# Extract Data

### Overview

This procedure extracts data from texts, images, documents, and files. Using LLM models, it identifies and retrieves the text content, making it easy to access and work with the information in those documents.

{% hint style="warning" %}
Before using this procedure, ensure you have added the **Document Processing** Book to your agent. After learning the Book, make sure to create a new Playground for it to take effect.
{% endhint %}

{% hint style="info" %}
This procedure can be used to operate on **PDFs**, **documents**, **images**, **text** or **files**. It is *not* currently supported for `.csv` files.
{% endhint %}

### Syntax

Below is a line-by-line overview of the automation syntax. Expand each line to learn more.

<details>

<summary><code>extract data from the source</code></summary>

#### What does it do?

Instructs the system to begin data extraction from the specified source.

#### Where does it go?

This phrase should be written on a **new line**.

#### Is it required?

✅ Yes — This phrase is **required**.

#### Does it require data?

✅ Yes — Replace **the source** with a reference to a data source in your automation (ex - `the file`, `the document`, `the text`, etc.).

#### Example

```
extract data from the document
```

</details>

<details>

<summary><code>the {position} field is "{name}"</code></summary>

#### What does it do?

Specifies the name of a field to be extracted from the source.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

✅ Yes — This phrase is **required**.

#### Does it require data?

✅ Yes — The **position** should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The **name** should be a **text** value that identifies the field.

#### Example

```
the first field is "invoice number"
```

</details>

<details>

<summary><code>the {position} field's rule is "{rule}"</code></summary>

#### What does it do?

Specifies the rule to be followed for the field extraction.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — The **position** should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The **rule** should be a **text** value that specifies a rule for the field extraction.

#### Example

```
the first field's rule is "keep just the first four digits"
```

</details>

<details>

<summary><code>the {position} field's format is "{format}"</code></summary>

#### What does it do?

Specifies the format of the fields that need to be extracted from the source.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — The **position** should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The **format** should be one of the following values: `number`, `string`, or `date`. The default format is `string`.

#### Example

```
the first field's format is "number"
```

</details>

<details>

<summary><code>the {position} field's default is "{default}"</code></summary>

#### What does it do?

Specifies the field's default value.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — The **position** should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the source. The **default** should be the default value itself.

#### Example

```
the first field's default value is 100
```

</details>

<details>

<summary><code>the common default value is x</code></summary>

#### What does it do?

Specifies the global default value. This will be overwritten by a field's default value.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **x** with any value to use as the common default.

#### Example

```
the common default value is "completed"
```

</details>

<details>

<summary><code>the openai model is "openai-model"</code></summary>

#### What does it do?

Specifies the name of the OpenAI model to use to generate the response.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **openai-model** with a **text** value containing the model's [name](https://docs.kognitos.com/llms#available-llm-models). The default is `gpt-4o-latest`.

#### Example

```
the openai model is "gpt-4.1"
```

</details>

<details>

<summary><code>the gemini model is "gemini-model"</code></summary>

#### What does it do?

Specifies the name of the Gemini model to use to generate the response.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **gemini-model** with a **text** value containing the model's [name](https://docs.kognitos.com/llms#available-llm-models). The default model is `gemini-2.5-pro` *(US region only)*.

#### Example

```
the gemini model is "gemini-2.0-flash"
```

</details>

<details>

<summary><code>the input schema is x</code></summary>

#### What does it do?

Specifies a JSON or YAML schema that defines the fields to extract, their formats, rules, and default values. This allows programmatic field definition instead of using individual field syntax.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **x** with a stringified JSON, YAML string, or S3 URL pointing to a JSON file.

**JSON format**: {"fields": \[{"name": "field\_name", "format": "string|number|date", "rule": "extraction rule", "default": "default\_value"}], "common\_default": "value"}

**S3 URL format**: s3://bucket-name/path/to/schema.json

#### Example

```
the input schema is {"fields": [{"name": "invoice_number", "format": "string"}]}
```

</details>

<details>

<summary><code>the visual reference is x</code></summary>

#### What does it do?

Specifies a document or image to guide the LLM as a visual reference, improving accuracy when extracting data. This is useful in multi-document scenarios or when the reference differs from the source text.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **x** with a reference to a document, image, or file.

#### Example

```
the visual reference is the file
```

</details>

<details>

<summary><code>the output format is "output-format"</code></summary>

#### What does it do?

Specifies an output format for the response.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **x** with a text value that is either `table` or `json`.

#### Example

```
the output format is "table"
```

</details>

<details>

<summary><code>the dpi is x</code></summary>

#### What does it do?

Specifies the DPI (dots per inch).

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **x** with a numeric value representing the DPI. The default is `100`.

#### Example

```
the dpi is 144
```

</details>

<details>

<summary><code>the extraction mode is "extraction-mode"</code></summary>

#### What does it do?

Specifies an extraction mode for the procedure.

#### Where does it go?

Indented under `extract data from the source`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **extraction-mode** with one of the following options:

* `"no ocr"`: Send the document image directly to the LLM without OCR text. Faster and usually just as accurate.
* `"file_upload"`: Upload the PDF directly to the LLM provider's Files API. Best for very large documents (100+ pages) as it avoids token limits from base64 encoding. Supports both OpenAI (gpt-4o models) and Gemini (gemini-2.0-flash, gemini-2.5-pro) models.

#### Example

```
the extraction mode is "no ocr"
```

</details>

### Examples

#### 1. Extract Multiple Fields from a Document

```
extract data from the document
    the dpi is 144
    the openai model is "gpt-4o-latest"
    the first field is "po number"
    the first field's format is "string"
    the first field's rule is "the po number has 10 characters"

    the second field is "due date"
    the second field's format is "date"
    the second field's rule is "format should be DD/MM/YY"
```

#### 2. Extract Data from Text

{% tabs %}
{% tab title="Automation" %}

```
the text is "The amount for the invoice number 123456 is Rs.1000."
extract data from the text
    the first field is "invoice numbers"
    the first field's format is "number"

    the second field is "invoice amount"
    the second field's format is "string"
    the second field's rule is "keep just the amount without the currency"
```

{% endtab %}

{% tab title="Results" %}
**the data's invoice numbers**: 123456
{% endtab %}
{% endtabs %}

#### 3. Extract Multiple Fields from Text (Using Default Values)

{% tabs %}
{% tab title="Automation" %}

```
the text is "The invoice date is 21 jan 2023"
extract data from the text
    the gemini model is "gemini-2.0-flash"
    the common default value is 5678

    the first field is "invoice number"
    the first field's default value is 1234

    the second field is "invoice date"

    the third field is "invoice location"
    the third field's default value is "San Jose"

    the fourth field is "invoice amount"
```

{% endtab %}

{% tab title="Results" %}
**invoice amount**: 5678 **invoice date**: 21 jan 2023 **invoice location**: San Jose **invoice number**: 1234
{% endtab %}
{% endtabs %}

#### 4. Extract Data Using JSON Input Schema

{% tabs %}
{% tab title="Automation" %}

```
the text is "Invoice #INV-12345 dated 2023-01-15 for $1,500.00 shipped to 123 Main St, San Jose, CA"
the input schema is {
    "fields": [
        {
            "name": "invoice_number",
            "format": "string",
            "rule": "extract the invoice number including the prefix"
        },
        {
            "name": "invoice_date",
            "format": "date",
            "rule": "format as YYYY-MM-DD"
        },
        {
            "name": "amount",
            "format": "number",
            "rule": "extract just the numeric value without currency symbol"
        },
        {
            "name": "shipping_address",
            "format": "string",
            "default": "Not specified"
        }
    ],
    "common_default": "N/A"
}
extract data from the text
```

{% endtab %}

{% tab title="Results" %}
**invoice\_number**: INV-12345 **invoice\_date**: 2023-01-15 **amount**: 1500.00 **shipping\_address**: 123 Main St, San Jose, CA
{% endtab %}
{% endtabs %}

#### 5. Extract Data Using S3 URL Input Schema

{% tabs %}
{% tab title="Automation" %}

```
the text is "Order #ORD-98765 placed on 2023-12-01 for $2,750.50"
the input schema is s3://my-schemas-bucket/extract-schemas/order-schema.json
extract data from the text
```

{% endtab %}

{% tab title="Results" %}
**order\_number**: ORD-98765 **order\_date**: 2023-12-01 **total\_amount**: 2750.50
{% endtab %}
{% endtabs %}

#### 6. Extract Data from Large Document Using File Upload Mode

{% tabs %}
{% tab title="Automation" %}

```
extract data from the document
    the extraction mode is "file_upload"
    the openai model is "gpt-4o-latest"
    the first field is "invoice number"
    the first field's format is "string"

    the second field is "total amount"
    the second field's format is "number"
```

{% endtab %}

{% tab title="Results" %}
**invoice number**: INV-123456 **total amount**: 5000.00
{% endtab %}
{% endtabs %}
