Get a Document's Thing

Retrieves specific content from a document.

Overview

This procedure uses OCR (Optical Character Recognition) to analyze documents and extract various types of structured and unstructured data, including:

  • Fields: Key-value pairs extracted from forms

  • Lines: Individual text lines from the document

  • Pages: Individual pages that can be processed separately

  • Paragraphs: Grouped text blocks

  • Tables: Tabular data structures

  • Text: Full text or filtered text content

  • Words: Individual words from the document

  • Custom Fields: User-defined fields stored in document metadata

Input

Concept
Type
Description
Required
Default

the document

file

The document from which content is to be retrieved. Must be a scanned document.

Yes

N/A

the thing

string

The specific type of content to retrieve from the document (e.g., 'field', 'line', 'paragraph', 'page', 'table', 'text', 'word'). Can include filters and qualifiers.

No

N/A

the department's textract query flag

boolean

If set, enables querying with Amazon Textract for advanced document analysis.

No

Not set

the department's ocr confidence threshold

number

The confidence threshold for OCR results. Content below this threshold will be ignored. Value must be between 0 and 1.

No

0.7

the department's duplicate column resolution mode

string

The mode for combining duplicate columns in a table (e.g., "merge cells").

No

N/A

Output

Concept
Description

result

Depending on the 'thing' requested, returns the specific content from the document. This could be a list of fields, lines, paragraphs, pages, tables, text strings, or other specified content. Each item may include confidence scores and location information.

Examples

1. Getting Fields from a Document

Extracts all form fields from a scanned document.

get the file as a scanned document
get the document's fields

2. Getting Lines from a Document

Retrieves all text lines from the document.

get the file as a scanned document
get the document's lines

3. Getting Pages from a Document

Extracts individual pages that can be processed separately.

get the file as a scanned document
get the document's pages

4. Getting a Specific Page

Gets the first page from the document.

get the file as a scanned document
get the document's pages
get the first page out of those

5. Getting Tables from a Document

Extracts all tables found in the document.

get the file as a scanned document
get the document's tables

6. Getting the Full Text

Retrieves the complete text content of the document.

get the file as a scanned document
get the document's full text

7. Getting Text Content

Retrieves the text content of the document.

get the file as a scanned document
get the document's text

8. Getting Page Texts

Retrieves text content for each page separately.

get the file as a scanned document
the document's page texts

9. Finding Specific Text

Searches for a specific line containing text.

get the file as a scanned document
the document's line which is "FACTURA VIRTUAL"

10. Finding Text That Contains a String

Finds text that contains a specific substring.

get the file as a scanned document
the document's text which contains "FACTURA VIRTUAL"

11. Finding Lines Containing Text

Finds all lines that contain a specific string.

get the file as a scanned document
find the document's lines which contain "GAMARRA"

12. Getting Filtered Fields

Gets selected checkbox fields from a form.

get the file as a scanned document
the document's selected fields

13. Getting Fields with Specific Values

Retrieves fields whose value matches a condition.

get the file as a scanned document
the document's fields whose value is selected

14. Getting Words from Document Pages

Extracts all words from the document's pages.

get the file as a scanned document
the document's pages's words

15. Getting Dates from a Document

Extracts all dates found in the document.

get the file as a scanned document
the document's dates

16. Getting Amounts from a Document

Extracts all monetary amounts from the document.

get the file as a scanned document
the document's amounts

17. Getting a Specific Field by Name

Retrieves a specific field value from the document.

get the file as a scanned document
get the document's second "address"

18. Getting Custom Metadata Fields

Retrieves custom fields stored in document metadata (e.g., from subdocument extraction).

get the file as a scanned document
get the document's invoice number

19. Finding Tables with Specific Columns

Finds tables that contain specific column names.

get the file as a scanned document
find the document's tables whose columns contain "Price"

20. Counting Pages

Gets the number of pages in a document. These are 3 different approaches:

get the file as a scanned document
get the number of the document's pages
get the file as a scanned document
get the document's pages
get the number of the above
get the file as a scanned document
get the document's pages
count the above

21. Getting Handwritten Lines

Counts only handwritten lines in the document.

get the file as a scanned document
count the document's handwritten lines

22. Getting the Document Filename

Retrieves the filename of the scanned document.

get the file as a scanned document
get the document's filename

Last updated

Was this helpful?