Get a Document's Thing

Retrieves specific content from a document.

Overview

This procedure uses OCR (Optical Character Recognition) to analyze documents and extract various types of structured and unstructured data, including:

Fields: Key-value pairs extracted from forms
Lines: Individual text lines from the document
Pages: Individual pages that can be processed separately
Paragraphs: Grouped text blocks
Tables: Tabular data structures
Text: Full text or filtered text content
Words: Individual words from the document
Custom Fields: User-defined fields stored in document metadata

Input

Concept

Type

Description

Required

Default

the document

file

The document from which content is to be retrieved. Must be a scanned document.

Yes

N/A

the thing

string

The specific type of content to retrieve from the document (e.g., 'field', 'line', 'paragraph', 'page', 'table', 'text', 'word'). Can include filters and qualifiers.

N/A

the department's textract query flag

boolean

If set, enables querying with Amazon Textract for advanced document analysis.

Not set

the department's ocr confidence threshold

number

The confidence threshold for OCR results. Content below this threshold will be ignored. Value must be between 0 and 1.

0.7

the department's duplicate column resolution mode

string

The mode for combining duplicate columns in a table (e.g., "merge cells").

N/A

Output

Concept

Description

result

Depending on the 'thing' requested, returns the specific content from the document. This could be a list of fields, lines, paragraphs, pages, tables, text strings, or other specified content. Each item may include confidence scores and location information.

Examples

1. Getting Fields from a Document

Extracts all form fields from a scanned document.

get the file as a scanned document
get the document's fields

2. Getting Lines from a Document

Retrieves all text lines from the document.

get the file as a scanned document
get the document's lines

3. Getting Pages from a Document

Extracts individual pages that can be processed separately.

get the file as a scanned document
get the document's pages

4. Getting a Specific Page

Gets the first page from the document.

get the file as a scanned document
get the document's pages
get the first page out of those

5. Getting Tables from a Document

Extracts all tables found in the document.

get the file as a scanned document
get the document's tables

6. Getting the Full Text

Retrieves the complete text content of the document.

get the file as a scanned document
get the document's full text

7. Getting Text Content

Retrieves the text content of the document.

get the file as a scanned document
get the document's text

8. Getting Page Texts

Retrieves text content for each page separately.

get the file as a scanned document
the document's page texts

9. Finding Specific Text

Searches for a specific line containing text.

get the file as a scanned document
the document's line which is "FACTURA VIRTUAL"

10. Finding Text That Contains a String

Finds text that contains a specific substring.

get the file as a scanned document
the document's text which contains "FACTURA VIRTUAL"

11. Finding Lines Containing Text

Finds all lines that contain a specific string.

get the file as a scanned document
find the document's lines which contain "GAMARRA"

12. Getting Filtered Fields

Gets selected checkbox fields from a form.

get the file as a scanned document
the document's selected fields

13. Getting Fields with Specific Values

Retrieves fields whose value matches a condition.

get the file as a scanned document
the document's fields whose value is selected

14. Getting Words from Document Pages

Extracts all words from the document's pages.

get the file as a scanned document
the document's pages's words

15. Getting Dates from a Document

Extracts all dates found in the document.

get the file as a scanned document
the document's dates

16. Getting Amounts from a Document

Extracts all monetary amounts from the document.

get the file as a scanned document
the document's amounts

17. Getting a Specific Field by Name

Retrieves a specific field value from the document.

get the file as a scanned document
get the document's second "address"

18. Getting Custom Metadata Fields

Retrieves custom fields stored in document metadata (e.g., from subdocument extraction).

get the file as a scanned document
get the document's invoice number

19. Finding Tables with Specific Columns

Finds tables that contain specific column names.

get the file as a scanned document
find the document's tables whose columns contain "Price"

20. Counting Pages

Gets the number of pages in a document. These are 3 different approaches:

get the file as a scanned document
get the number of the document's pages

get the file as a scanned document
get the document's pages
get the number of the above

get the file as a scanned document
get the document's pages
count the above

21. Getting Handwritten Lines

Counts only handwritten lines in the document.

get the file as a scanned document
count the document's handwritten lines

22. Getting the Document Filename

Retrieves the filename of the scanned document.

get the file as a scanned document
get the document's filename

PreviousFetch Labels from a Document NextGetting Fields from a Document

Last updated 3 months ago

Was this helpful?

hashtagOverview

hashtagInput

hashtagOutput

hashtagExamples

hashtag1. Getting Fields from a Document

hashtag2. Getting Lines from a Document

hashtag3. Getting Pages from a Document

hashtag4. Getting a Specific Page

hashtag5. Getting Tables from a Document

hashtag6. Getting the Full Text

hashtag7. Getting Text Content

hashtag8. Getting Page Texts

hashtag9. Finding Specific Text

hashtag10. Finding Text That Contains a String

hashtag11. Finding Lines Containing Text

hashtag12. Getting Filtered Fields

hashtag13. Getting Fields with Specific Values

hashtag14. Getting Words from Document Pages

hashtag15. Getting Dates from a Document

hashtag16. Getting Amounts from a Document

hashtag17. Getting a Specific Field by Name

hashtag18. Getting Custom Metadata Fields

hashtag19. Finding Tables with Specific Columns

hashtag20. Counting Pages

hashtag21. Getting Handwritten Lines

hashtag22. Getting the Document Filename