Extract Data from Multiple Documents

Extracts data from multiple documents in a single LLM call using file upload.

Overview

This procedure extracts data from multiple documents simultaneously by uploading all documents to the LLM provider and making a single API call. This is more efficient than processing documents individually and is particularly useful for batch processing of large document sets. The documents are uploaded using the LLM provider's Files API (OpenAI or Gemini).

This procedure uploads PDF files directly to the LLM provider without OCR processing. It does not support scanned documents. All documents are processed in a single LLM call for better efficiency and consistency.

Syntax

Below is a line-by-line overview of the automation syntax. Expand each line to learn more.

extract data from the documents

What does it do?

Instructs the system to begin data extraction from multiple documents.

Where does it go?

This phrase should be written on a new line.

Is it required?

✅ Yes — This phrase is required.

Does it require data?

✅ Yes — The documents must be available as a list or collection in your automation.

Example

extract data from the documents
the {position} field is "{name}"

What does it do?

Specifies the name of a field to be extracted from the documents.

Where does it go?

Indented under extract data from the documents.

Is it required?

✅ Yes — This phrase is required.

Does it require data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the documents. The name should be a text value that identifies the field.

Example

the first field is "invoice number"
the {position} field's rule is "{rule}"

What does it do?

Specifies the rule to be followed for the field extraction.

Where does it go?

Indented under extract data from the documents.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the documents. The rule should be a text value that specifies a rule for the field extraction.

Example

the first field's rule is "keep just the first four digits"
the {position} field's format is "{format}"

What does it do?

Specifies the format of the fields that need to be extracted.

Where does it go?

Indented under extract data from the documents.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the documents. The format should be one of the following values: number, string, or date. The default format is string.

Example

the first field's format is "number"
the {position} field's default is "{default}"

What does it do?

Specifies the field's default value.

Where does it go?

Indented under extract data from the documents.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the documents. The default should be the default value itself.

Example

the first field's default value is 100
the common default value is x

What does it do?

Specifies the global default value. This will be overwritten by a field's default value.

Where does it go?

Indented under extract data from the documents.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace x with any value to use as the common default.

Example

the common default value is "completed"
the openai model is "openai-model"

What does it do?

Specifies the name of the OpenAI model to use to generate the response.

Where does it go?

Indented under extract data from the documents.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace openai-model with a text value containing the model's name. The default is gpt-4o-latest.

Example

the openai model is "gpt-4o"
the gemini model is "gemini-model"

What does it do?

Specifies the name of the Gemini model to use to generate the response.

Where does it go?

Indented under extract data from the documents.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace gemini-model with a text value containing the model's name. The default model is gemini-2.5-pro (US region only).

Example

the gemini model is "gemini-2.0-flash"
the input schema is x

What does it do?

Specifies a JSON or YAML schema that defines the fields to extract, their formats, rules, and default values. This allows programmatic field definition instead of using individual field syntax.

Where does it go?

Indented under extract data from the documents.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace x with a stringified JSON, YAML string, or S3 URL pointing to a JSON file. JSON format: {"fields": [{"name": "field_name", "format": "string|number|date", "rule": "extraction rule", "default": "default_value"}], "common_default": "value"} S3 URL format: s3://bucket-name/path/to/schema.json

Example

the input schema is {"fields": [{"name": "invoice_number", "format": "string"}]}
the output format is "output-format"

What does it do?

Specifies an output format for the response.

Where does it go?

Indented under extract data from the documents.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace output-format with a text value that is either table or json.

Example

the output format is "table"

Examples

1. Extract Multiple Fields from Multiple Documents

the documents are the list of files
extract data from the documents
    the openai model is "gpt-4o"
    the first field is "invoice number"
    the first field's format is "string"

    the second field is "invoice date"
    the second field's format is "date"

    the third field is "total amount"
    the third field's format is "number"

Last updated

Was this helpful?