# Extract Data from Multiple Documents

### Overview

This procedure extracts data from multiple documents simultaneously by uploading all documents to the LLM provider and making a single API call. This is more efficient than processing documents individually and is particularly useful for batch processing of large document sets. The documents are uploaded using the LLM provider's Files API (OpenAI or Gemini).

{% hint style="warning" %}
Before using this procedure, ensure you have added the **Document Processing** Book to your agent. After learning the Book, make sure to create a new Playground for it to take effect.
{% endhint %}

{% hint style="info" %}
This procedure uploads PDF files directly to the LLM provider without OCR processing. It does not support scanned documents. All documents are processed in a single LLM call for better efficiency and consistency.
{% endhint %}

### Syntax

Below is a line-by-line overview of the automation syntax. Expand each line to learn more.

<details>

<summary><code>extract data from the documents</code></summary>

#### What does it do?

Instructs the system to begin data extraction from multiple documents.

#### Where does it go?

This phrase should be written on a **new line**.

#### Is it required?

✅ Yes — This phrase is **required**.

#### Does it require data?

✅ Yes — The documents must be available as a list or collection in your automation.

#### Example

```
extract data from the documents
```

</details>

<details>

<summary><code>the {position} field is "{name}"</code></summary>

#### What does it do?

Specifies the name of a field to be extracted from the documents.

#### Where does it go?

Indented under `extract data from the documents`.

#### Is it required?

✅ Yes — This phrase is **required**.

#### Does it require data?

✅ Yes — The **position** should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the documents. The **name** should be a **text** value that identifies the field.

#### Example

```
the first field is "invoice number"
```

</details>

<details>

<summary><code>the {position} field's rule is "{rule}"</code></summary>

#### What does it do?

Specifies the rule to be followed for the field extraction.

#### Where does it go?

Indented under `extract data from the documents`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — The **position** should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the documents. The **rule** should be a **text** value that specifies a rule for the field extraction.

#### Example

```
the first field's rule is "keep just the first four digits"
```

</details>

<details>

<summary><code>the {position} field's format is "{format}"</code></summary>

#### What does it do?

Specifies the format of the fields that need to be extracted.

#### Where does it go?

Indented under `extract data from the documents`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — The **position** should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the documents. The **format** should be one of the following values: `number`, `string`, or `date`. The default format is `string`.

#### Example

```
the first field's format is "number"
```

</details>

<details>

<summary><code>the {position} field's default is "{default}"</code></summary>

#### What does it do?

Specifies the field's default value.

#### Where does it go?

Indented under `extract data from the documents`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — The **position** should be a word like "first", "second", "third", etc. to indicate the order of the field to extract from the documents. The **default** should be the default value itself.

#### Example

```
the first field's default value is 100
```

</details>

<details>

<summary><code>the common default value is x</code></summary>

#### What does it do?

Specifies the global default value. This will be overwritten by a field's default value.

#### Where does it go?

Indented under `extract data from the documents`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **x** with any value to use as the common default.

#### Example

```
the common default value is "completed"
```

</details>

<details>

<summary><code>the openai model is "openai-model"</code></summary>

#### What does it do?

Specifies the name of the OpenAI model to use to generate the response.

#### Where does it go?

Indented under `extract data from the documents`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **openai-model** with a **text** value containing the model's [name](https://docs.kognitos.com/llms#available-llm-models). The default is `gpt-4o-latest`.

#### Example

```
the openai model is "gpt-4o"
```

</details>

<details>

<summary><code>the gemini model is "gemini-model"</code></summary>

#### What does it do?

Specifies the name of the Gemini model to use to generate the response.

#### Where does it go?

Indented under `extract data from the documents`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **gemini-model** with a **text** value containing the model's [name](https://docs.kognitos.com/llms#available-llm-models). The default model is `gemini-2.5-pro` *(US region only)*.

#### Example

```
the gemini model is "gemini-2.0-flash"
```

</details>

<details>

<summary><code>the input schema is x</code></summary>

#### What does it do?

Specifies a JSON or YAML schema that defines the fields to extract, their formats, rules, and default values. This allows programmatic field definition instead of using individual field syntax.

#### Where does it go?

Indented under `extract data from the documents`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **x** with a stringified JSON, YAML string, or S3 URL pointing to a JSON file. JSON format: {"fields": \[{"name": "field\_name", "format": "string|number|date", "rule": "extraction rule", "default": "default\_value"}], "common\_default": "value"} S3 URL format: s3://bucket-name/path/to/schema.json

#### Example

```
the input schema is {"fields": [{"name": "invoice_number", "format": "string"}]}
```

</details>

<details>

<summary><code>the output format is "output-format"</code></summary>

#### What does it do?

Specifies an output format for the response.

#### Where does it go?

Indented under `extract data from the documents`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **output-format** with a text value that is either `table` or `json`.

#### Example

```
the output format is "table"
```

</details>

### Examples

#### 1. Extract Multiple Fields from Multiple Documents

{% tabs %}
{% tab title="Automation" %}

```
the documents are the list of files
extract data from the documents
    the openai model is "gpt-4o"
    the first field is "invoice number"
    the first field's format is "string"

    the second field is "invoice date"
    the second field's format is "date"

    the third field is "total amount"
    the third field's format is "number"
```

{% endtab %}

{% tab title="Results" %}
Returns a list of JSON objects, one for each document with extracted fields.
{% endtab %}
{% endtabs %}
