Get a Document's Thing

Retrieves specific content from a document.

Overview

This procedure uses OCR (Optical Character Recognition) to analyze documents and extract various types of structured and unstructured data, including:

  • Fields: Key-value pairs extracted from forms

  • Lines: Individual text lines from the document

  • Pages: Individual pages that can be processed separately

  • Paragraphs: Grouped text blocks

  • Tables: Tabular data structures

  • Text: Full text or filtered text content

  • Words: Individual words from the document

  • Custom Fields: User-defined fields stored in document metadata

Input

Concept
Type
Description
Required
Default

the document

file

The document from which content is to be retrieved. Must be a scanned document.

Yes

N/A

the thing

string

The specific type of content to retrieve from the document (e.g., 'field', 'line', 'paragraph', 'page', 'table', 'text', 'word'). Can include filters and qualifiers.

No

N/A

the department's textract query flag

boolean

If set, enables querying with Amazon Textract for advanced document analysis.

No

Not set

the department's ocr confidence threshold

number

The confidence threshold for OCR results. Content below this threshold will be ignored. Value must be between 0 and 1.

No

0.7

the department's duplicate column resolution mode

string

The mode for combining duplicate columns in a table (e.g., "merge cells").

No

N/A

Output

Concept
Description

result

Depending on the 'thing' requested, returns the specific content from the document. This could be a list of fields, lines, paragraphs, pages, tables, text strings, or other specified content. Each item may include confidence scores and location information.

Examples

1. Getting Fields from a Document

Extracts all form fields from a scanned document.

2. Getting Lines from a Document

Retrieves all text lines from the document.

3. Getting Pages from a Document

Extracts individual pages that can be processed separately.

4. Getting a Specific Page

Gets the first page from the document.

5. Getting Tables from a Document

Extracts all tables found in the document.

6. Getting the Full Text

Retrieves the complete text content of the document.

7. Getting Text Content

Retrieves the text content of the document.

8. Getting Page Texts

Retrieves text content for each page separately.

9. Finding Specific Text

Searches for a specific line containing text.

10. Finding Text That Contains a String

Finds text that contains a specific substring.

11. Finding Lines Containing Text

Finds all lines that contain a specific string.

12. Getting Filtered Fields

Gets selected checkbox fields from a form.

13. Getting Fields with Specific Values

Retrieves fields whose value matches a condition.

14. Getting Words from Document Pages

Extracts all words from the document's pages.

15. Getting Dates from a Document

Extracts all dates found in the document.

16. Getting Amounts from a Document

Extracts all monetary amounts from the document.

17. Getting a Specific Field by Name

Retrieves a specific field value from the document.

18. Getting Custom Metadata Fields

Retrieves custom fields stored in document metadata (e.g., from subdocument extraction).

19. Finding Tables with Specific Columns

Finds tables that contain specific column names.

20. Counting Pages

Gets the number of pages in a document. These are 3 different approaches:

21. Getting Handwritten Lines

Counts only handwritten lines in the document.

22. Getting the Document Filename

Retrieves the filename of the scanned document.

Last updated

Was this helpful?