# Extract Subdocuments

### Overview

This procedure extracts **multiple subdocuments** from a large document by either identifying recurring content patterns (like invoices or chapters) or by splitting the document into fixed-size chunks with optional overlap. Each subdocument becomes a separate document that can be processed independently, making it ideal for batch processing of multi-document files.

{% hint style="warning" %}
Make sure to add the **Document Processing Book** to your agent before using this automation procedure.
{% endhint %}

### Syntax

Below is a line-by-line overview of the automation syntax. Expand each line to learn more.

<details>

<summary><code>extract subdocuments from {the source}</code></summary>

#### What does it do?

Begins extraction of multiple subdocuments from the specified document.

#### Where does it go?

This phrase should be written on a **new line**.

#### Is it required?

✅ Yes — This phrase is **required**.

#### Does it require data?

✅ Yes — Replace **the source** with a reference to a document or file.

</details>

<details>

<summary><code>the start page marker is "start-description"</code></summary>

#### What does it do?

Uses AI to identify pages that mark the beginning of new subdocuments.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **start-description** with a description of content that marks new sections.

#### Example

```
the start page marker is "The beginning of a new invoice"
```

</details>

<details>

<summary><code>the end page marker is "end-description"</code></summary>

#### What does it do?

Uses AI to find the ending page for each subdocument based on content description *(inclusive)*.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **end-description** with a description of what content marks the end. Used with start page marker for non-contiguous subdocuments.

#### Example

```
the end page marker is "Page containing invoice total"
```

</details>

<details>

<summary><code>the excluded end page marker is "excluded-end-description"</code></summary>

#### What does it do?

Uses AI to find the ending page for each subdocument based on content description *(exclusive)*.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **excluded-end-description** with a description of what content marks the end *(page not included)*. Used with start page marker for non-contiguous subdocuments.

#### Example

```
the excluded end page marker is "Page containing Bill of Lading"
```

</details>

<details>

<summary><code>the subdocument size is n</code></summary>

#### What does it do?

Splits the document into fixed-size chunks or limits subdocument length when used with markers.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **n** with the number of pages per subdocument. The default is `10`.

#### Example

```
the subdocument size is 5
```

</details>

<details>

<summary><code>the subdocument overlap size is x</code></summary>

#### What does it do?

Specifies pages of overlap between consecutive subdocuments.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **x** with the number of overlapping pages. Only used with subdocument size for chunking. The default is `1`.

#### Example

```
the subdocument overlap size is 1
```

</details>

<details>

<summary><code>the openai model is "openai-model"</code></summary>

#### What does it do?

Specifies the OpenAI model to use for marker-based extraction.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **openai-model** with a valid [OpenAI model](https://docs.kognitos.com/llms#available-llm-models). The default is `gpt-4o`.

#### Example

```
the openai model is "gpt-4o"
```

</details>

<details>

<summary><code>the first field is "field-name"</code></summary>

#### What does it do?

Specifies fields to extract from each subdocument for identification.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **field-name** with the name of a field to extract. Used with markers.

#### Example

```
the first field is "invoice number"
```

</details>

<details>

<summary><code>the first field's format is "field-format"</code></summary>

#### What does it do?

Specifies the format of the first field.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **field-format** with "string", "number", or "date".

#### Example

```
the first field's format is "string"
```

</details>

### Examples

#### 1. Extract Non-Contiguous Invoice Subdocuments

Extracts only invoices from a mixed document containing invoices and BOL documents.

```
extract subdocuments from the document where
    the start page marker is "The beginning of a new invoice"
    the excluded end page marker is "Page containing Bill of Lading or the beginning of a new invoice"
```

#### 2. Extract Invoice Subdocuments with Field Extraction

Splits a batch invoice file into individual invoices and extracts key fields.

```
extract subdocuments from the document where
    the start page marker is "The beginning of a new invoice"
    the first field is "invoice number"
    the first field's format is "string"
    the second field is "invoice date"
    the second field's format is "string"
```

#### 3. Extract Invoices with Inclusive End Marker

Extracts invoices from start to a page containing the invoice total (included).

```
extract subdocuments from the document where
    the start page marker is "Page containing invoice header"
    the end page marker is "Page containing invoice total"
```

#### 4. Extract Fixed-Size Chunks with Overlap

Splits a large report into 5-page chunks with 1-page overlap.

```
extract subdocuments from the report where
    the subdocument size is 5
    the subdocument overlap size is 1
```

#### 5. Extract Fixed-Size Chunks without Overlap

Splits a document into 5-page chunks with no overlap.

```
extract subdocuments from the document where
    the subdocument size is 5
```

#### 6. Extract Chapter-Based Subdocuments

Splits a document by identifying chapter beginnings.

```
extract subdocuments from the document where
    the start page marker is "Page containing the text 'Chapter'"
    the openai model is "gpt-4o"
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.kognitos.com/legacy/legacy-experience/automation-areas/llm/automation-procedures/extract-subdocuments.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
