Extract Data

Extracts data from texts, documents, or files using LLM models.

Overview

This procedure extracts data from texts, images, documents, and files. Using LLM models, it identifies and retrieves the text content, making it easy to access and work with the information in those documents.

circle-exclamation
circle-info

This procedure can be used to operate on PDFs, documents, images, text or files. It is not currently supported for .csv files.

Syntax

Below is a line-by-line overview of the automation syntax. Expand each line to learn more.

chevron-rightextract data from the sourcehashtag

What does it do?

Instructs the system to begin data extraction from the specified source.

Where does it go?

This phrase should be written on a new line.

Is it required?

✅ Yes — This phrase is required.

Does it require data?

✅ Yes — Replace the source with a reference to a data source in your automation (ex - the file, the document, the text, etc.).

Example

extract data from the document
chevron-rightthe {position} field is "{name}"hashtag

What does it do?

Specifies the name of a field to be extracted from the source.

Where does it go?

Indented under extract data from the source.

Is it required?

✅ Yes — This phrase is required.

Does it require data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The name should be a text value that identifies the field.

Example

the first field is "invoice number"
chevron-rightthe {position} field's rule is "{rule}"hashtag

What does it do?

Specifies the rule to be followed for the field extraction.

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The rule should be a text value that specifies a rule for the field extraction.

Example

the first field's rule is "keep just the first four digits"
chevron-rightthe {position} field's format is "{format}"hashtag

What does it do?

Specifies the format of the fields that need to be extracted from the source.

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The format should be one of the following values: number, string, or date. The default format is string.

Example

chevron-rightthe {position} field's default is "{default}"hashtag

What does it do?

Specifies the field's default value.

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The default should be the default value itself.

Example

chevron-rightthe common default value is xhashtag

What does it do?

Specifies the global default value. This will be overwritten by a field's default value.

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace x with any value to use as the common default.

Example

chevron-rightthe openai model is "openai-model"hashtag

What does it do?

Specifies the name of the OpenAI model to use to generate the response.

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace openai-model with a text value containing the model's namearrow-up-right. The default is gpt-4o-latest.

Example

chevron-rightthe gemini model is "gemini-model"hashtag

What does it do?

Specifies the name of the Gemini model to use to generate the response.

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace gemini-model with a text value containing the model's namearrow-up-right. The default model is gemini-2.5-pro (US region only).

Example

chevron-rightthe input schema is xhashtag

What does it do?

Specifies a JSON or YAML schema that defines the fields to extract, their formats, rules, and default values. This allows programmatic field definition instead of using individual field syntax.

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace x with a stringified JSON, YAML string, or S3 URL pointing to a JSON file.

JSON format: {"fields": [{"name": "field_name", "format": "string|number|date", "rule": "extraction rule", "default": "default_value"}], "common_default": "value"}

S3 URL format: s3://bucket-name/path/to/schema.json

Example

chevron-rightthe visual reference is xhashtag

What does it do?

Specifies a document or image to guide the LLM as a visual reference, improving accuracy when extracting data. This is useful in multi-document scenarios or when the reference differs from the source text.

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace x with a reference to a document, image, or file.

Example

chevron-rightthe output format is "output-format"hashtag

What does it do?

Specifies an output format for the response.

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace x with a text value that is either table or json.

Example

chevron-rightthe dpi is xhashtag

What does it do?

Specifies the DPI (dots per inch).

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace x with a numeric value representing the DPI. The default is 100.

Example

chevron-rightthe extraction mode is "extraction-mode"hashtag

What does it do?

Specifies an extraction mode for the procedure.

Where does it go?

Indented under extract data from the source.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace extraction-mode with one of the following options:

  • "no ocr": Send the document image directly to the LLM without OCR text. Faster and usually just as accurate.

  • "file_upload": Upload the PDF directly to the LLM provider's Files API. Best for very large documents (100+ pages) as it avoids token limits from base64 encoding. Supports both OpenAI (gpt-4o models) and Gemini (gemini-2.0-flash, gemini-2.5-pro) models.

Example

Examples

1. Extract Multiple Fields from a Document

2. Extract Data from Text

3. Extract Multiple Fields from Text (Using Default Values)

4. Extract Data Using JSON Input Schema

5. Extract Data Using S3 URL Input Schema

6. Extract Data from Large Document Using File Upload Mode

Last updated

Was this helpful?