Extract Data

Extracts data from texts, documents, or files using LLM models.

Overview

This procedure extracts data from texts, documents, and files. Using AI models, it identifies and retrieves the text content, making it easy to access and work with the information in those documents. This procedure can be used to get information from a PDF, document, or file.

Before using this procedure, ensure you have learned the Document Processing Book in your agent. After learning the Book, make sure to publish your Agent and create a new Playground for it to take effect.

Syntax

Below is a line-by-line overview of the automation syntax. Expand each line to learn more.

extract data from the {source}

What does it do?

Instructs the system to begin data extraction from the specified source.

Where does it go?

This phrase should be written on a new line.

Is it required?

✅ Yes — This phrase is required in the syntax.

Does it require input data?

✅ Yes — A name should be specified in place of source (ex - file, document, text, etc.)

Example

extract data from the document
the {position} field is "{name}"

What does it do?

Specifies the name of a field to be extracted from the source.

Where does it go?

This phrase should be indented beneath extract data from the source.

Is it required?

✅ Yes — This phrase is required in the syntax.

Does it require input data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The name should be a text value that identifies the field.

Example

the first field is "invoice number"
the {position} field's rule is "{rule}"

What does it do?

Specifies the rule to be followed for the field extraction.

Where does it go?

This phrase should be indented beneath extract data from the source.

Is it required?

🌟 No — This phrase is optional in the syntax.

Does it require input data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The rule should be a text value that specifies a rule for the field extraction.

Example

the first field's rule is "keep just the first four digits"
the {position} field's format is "{format}"

What does it do?

Specifies the format of the fields that need to be extracted from the source.

Where does it go?

This phrase should be indented beneath extract data from the source.

Is it required?

🌟 No — This phrase is optional in the syntax.

Does it require input data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The format should be one of the following values: number, string, date.

Example

the first field's format is "number"
the {position} field's default is "{default}"

What does it do?

Specifies the field's default value.

Where does it go?

This phrase should be indented beneath extract data from the source.

Is it required?

🌟 No — This phrase is optional in the syntax.

Does it require input data?

✅ Yes — The position should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The default should be the default value itself.

Example

the first field's default value is 100
the common default value is

What does it do?

Specifies the global default value. This will be overwritten by a field's default value.

Where does it go?

This phrase should be indented beneath extract data from the source.

Is it required?

🌟 No — This phrase is optional in the syntax.

Does it require input data?

✅ Yes — Any value can be specified.

Example

the common default value is "completed"
the openai model is

What does it do?

Specifies the name of the OpenAI model to use to generate the response.

Where does it go?

This phrase should be indented beneath extract data from the source.

Is it required?

🌟 No — This phrase is optional in the syntax.

Does it require input data?

✅ Yes — A text value specifying the model's name.

Example

the openai model is "gpt-4.1"

Default

The default value is gpt-4o-latest.

the gemini model is

What does it do?

Specifies the name of the Gemini model to use to generate the response.

Where does it go?

This phrase should be indented beneath extract data from the source.

Is it required?

🌟 No — This phrase is optional in the syntax.

Does it require input data?

✅ Yes — A text value specifying the model's name; no default is set.

Example

the gemini model is "gemini-2.5-pro"
the visual reference is

What does it do?

Specifies a document or image that serves as a visual reference for the LLM to enhance its accuracy. This refers to data defined earlier in the automation.

Where does it go?

This phrase should be indented beneath extract data from the source.

Is it required?

🌟 No — This phrase is optional in the syntax.

Does it require input data?

✅ Yes — A Text value should be specified.

Example

the visual reference is the file
the output format is

What does it do?

Specifies an output format for the response.

Where does it go?

This phrase should be indented beneath extract data from the source.

Is it required?

🌟 No — This phrase is optional in the syntax.

Does it require input data?

✅ Yes — A text value that is either "table" or "json".

Example

the output format is "table"

Examples

1. Example 1

extract data from the file
    the first field is "invoice number"
    the first field's format is "number"
    the first field's rule is "keep just the first four digits"

    the second field is "invoice amounts"
    the second field's format is "string"
    the second field's rule is "keep just the amount without the currency"

    the third field is "invoice date"
    the third field's format is "date"

    the openai model is "gpt-4o-latest"
    the visual reference is the file

the data's invoice amounts
get the invoice number from the data

2. Example 2

extract data from the text
    the text is "Recently watched films: 'Inception', 'The Matrix', and 'The Godfather'
    the output format is "list of texts"
    the first field is "movies"
    the first field's rule is "Select the movies that are sci-fi"

3. Example 3

extract data from the text
    the openai model is "gpt-4o-latest"
    the output format is "table"

    the first field is "wedding date"
    the first field's format is "date"
    the first field's rule is "format should be DD.MM.YY"
    the first field's default value is "21.05.22"

    the second field is "po number"
    the second field's format is "number"
    the second field's rule is "the po number has 4 digits"
    the second field's default value is 341

Last updated

Was this helpful?