Get a Document's Thing
Retrieves specific content from a document.
Overview
This procedure uses OCR (Optical Character Recognition) to analyze documents and extract various types of structured and unstructured data, including:
Fields: Key-value pairs extracted from forms
Lines: Individual text lines from the document
Pages: Individual pages that can be processed separately
Paragraphs: Grouped text blocks
Tables: Tabular data structures
Text: Full text or filtered text content
Words: Individual words from the document
Custom Fields: User-defined fields stored in document metadata
Input
the document
file
The document from which content is to be retrieved. Must be a scanned document.
Yes
N/A
the thing
string
The specific type of content to retrieve from the document (e.g., 'field', 'line', 'paragraph', 'page', 'table', 'text', 'word'). Can include filters and qualifiers.
No
N/A
the department's textract query flag
boolean
If set, enables querying with Amazon Textract for advanced document analysis.
No
Not set
the department's ocr confidence threshold
number
The confidence threshold for OCR results. Content below this threshold will be ignored. Value must be between 0 and 1.
No
0.7
the department's duplicate column resolution mode
string
The mode for combining duplicate columns in a table (e.g., "merge cells").
No
N/A
Output
result
Depending on the 'thing' requested, returns the specific content from the document. This could be a list of fields, lines, paragraphs, pages, tables, text strings, or other specified content. Each item may include confidence scores and location information.
Examples
1. Getting Fields from a Document
Extracts all form fields from a scanned document.
2. Getting Lines from a Document
Retrieves all text lines from the document.
3. Getting Pages from a Document
Extracts individual pages that can be processed separately.
4. Getting a Specific Page
Gets the first page from the document.
5. Getting Tables from a Document
Extracts all tables found in the document.
6. Getting the Full Text
Retrieves the complete text content of the document.
7. Getting Text Content
Retrieves the text content of the document.
8. Getting Page Texts
Retrieves text content for each page separately.
9. Finding Specific Text
Searches for a specific line containing text.
10. Finding Text That Contains a String
Finds text that contains a specific substring.
11. Finding Lines Containing Text
Finds all lines that contain a specific string.
12. Getting Filtered Fields
Gets selected checkbox fields from a form.
13. Getting Fields with Specific Values
Retrieves fields whose value matches a condition.
14. Getting Words from Document Pages
Extracts all words from the document's pages.
15. Getting Dates from a Document
Extracts all dates found in the document.
16. Getting Amounts from a Document
Extracts all monetary amounts from the document.
17. Getting a Specific Field by Name
Retrieves a specific field value from the document.
18. Getting Custom Metadata Fields
Retrieves custom fields stored in document metadata (e.g., from subdocument extraction).
19. Finding Tables with Specific Columns
Finds tables that contain specific column names.
20. Counting Pages
Gets the number of pages in a document. These are 3 different approaches:
21. Getting Handwritten Lines
Counts only handwritten lines in the document.
22. Getting the Document Filename
Retrieves the filename of the scanned document.
Last updated
Was this helpful?
