Get a Document's Thing
Retrieves specific content from a document.
Overview
This procedure uses OCR (Optical Character Recognition) to analyze documents and extract various types of structured and unstructured data, including:
Fields: Key-value pairs extracted from forms
Lines: Individual text lines from the document
Pages: Individual pages that can be processed separately
Paragraphs: Grouped text blocks
Tables: Tabular data structures
Text: Full text or filtered text content
Words: Individual words from the document
Custom Fields: User-defined fields stored in document metadata
Input
the document
file
The document from which content is to be retrieved. Must be a scanned document.
Yes
N/A
the thing
string
The specific type of content to retrieve from the document (e.g., 'field', 'line', 'paragraph', 'page', 'table', 'text', 'word'). Can include filters and qualifiers.
No
N/A
the department's textract query flag
boolean
If set, enables querying with Amazon Textract for advanced document analysis.
No
Not set
the department's ocr confidence threshold
number
The confidence threshold for OCR results. Content below this threshold will be ignored. Value must be between 0 and 1.
No
0.7
the department's duplicate column resolution mode
string
The mode for combining duplicate columns in a table (e.g., "merge cells").
No
N/A
Output
result
Depending on the 'thing' requested, returns the specific content from the document. This could be a list of fields, lines, paragraphs, pages, tables, text strings, or other specified content. Each item may include confidence scores and location information.
Examples
1. Getting Fields from a Document
Extracts all form fields from a scanned document.
get the file as a scanned document
get the document's fields2. Getting Lines from a Document
Retrieves all text lines from the document.
get the file as a scanned document
get the document's lines3. Getting Pages from a Document
Extracts individual pages that can be processed separately.
get the file as a scanned document
get the document's pages4. Getting a Specific Page
Gets the first page from the document.
get the file as a scanned document
get the document's pages
get the first page out of those5. Getting Tables from a Document
Extracts all tables found in the document.
get the file as a scanned document
get the document's tables6. Getting the Full Text
Retrieves the complete text content of the document.
get the file as a scanned document
get the document's full text7. Getting Text Content
Retrieves the text content of the document.
get the file as a scanned document
get the document's text8. Getting Page Texts
Retrieves text content for each page separately.
get the file as a scanned document
the document's page texts9. Finding Specific Text
Searches for a specific line containing text.
get the file as a scanned document
the document's line which is "FACTURA VIRTUAL"10. Finding Text That Contains a String
Finds text that contains a specific substring.
get the file as a scanned document
the document's text which contains "FACTURA VIRTUAL"11. Finding Lines Containing Text
Finds all lines that contain a specific string.
get the file as a scanned document
find the document's lines which contain "GAMARRA"12. Getting Filtered Fields
Gets selected checkbox fields from a form.
get the file as a scanned document
the document's selected fields13. Getting Fields with Specific Values
Retrieves fields whose value matches a condition.
get the file as a scanned document
the document's fields whose value is selected14. Getting Words from Document Pages
Extracts all words from the document's pages.
get the file as a scanned document
the document's pages's words15. Getting Dates from a Document
Extracts all dates found in the document.
get the file as a scanned document
the document's dates16. Getting Amounts from a Document
Extracts all monetary amounts from the document.
get the file as a scanned document
the document's amounts17. Getting a Specific Field by Name
Retrieves a specific field value from the document.
get the file as a scanned document
get the document's second "address"18. Getting Custom Metadata Fields
Retrieves custom fields stored in document metadata (e.g., from subdocument extraction).
get the file as a scanned document
get the document's invoice number19. Finding Tables with Specific Columns
Finds tables that contain specific column names.
get the file as a scanned document
find the document's tables whose columns contain "Price"20. Counting Pages
Gets the number of pages in a document. These are 3 different approaches:
get the file as a scanned document
get the number of the document's pagesget the file as a scanned document
get the document's pages
get the number of the aboveget the file as a scanned document
get the document's pages
count the above21. Getting Handwritten Lines
Counts only handwritten lines in the document.
get the file as a scanned document
count the document's handwritten lines22. Getting the Document Filename
Retrieves the filename of the scanned document.
get the file as a scanned document
get the document's filenameLast updated
Was this helpful?
