# Extract Subdocuments

### Overview

This procedure extracts **multiple subdocuments** from a large document by either identifying recurring content patterns (like invoices or chapters) or by splitting the document into fixed-size chunks with optional overlap. Each subdocument becomes a separate document that can be processed independently, making it ideal for batch processing of multi-document files.

{% hint style="warning" %}
Make sure to add the **Document Processing Book** to your agent before using this automation procedure.
{% endhint %}

### Syntax

Below is a line-by-line overview of the automation syntax. Expand each line to learn more.

<details>

<summary><code>extract subdocuments from {the source}</code></summary>

#### What does it do?

Begins extraction of multiple subdocuments from the specified document.

#### Where does it go?

This phrase should be written on a **new line**.

#### Is it required?

✅ Yes — This phrase is **required**.

#### Does it require data?

✅ Yes — Replace **the source** with a reference to a document or file.

</details>

<details>

<summary><code>the start page marker is "start-description"</code></summary>

#### What does it do?

Uses AI to identify pages that mark the beginning of new subdocuments.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **start-description** with a description of content that marks new sections.

#### Example

```
the start page marker is "The beginning of a new invoice"
```

</details>

<details>

<summary><code>the end page marker is "end-description"</code></summary>

#### What does it do?

Uses AI to find the ending page for each subdocument based on content description *(inclusive)*.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **end-description** with a description of what content marks the end. Used with start page marker for non-contiguous subdocuments.

#### Example

```
the end page marker is "Page containing invoice total"
```

</details>

<details>

<summary><code>the excluded end page marker is "excluded-end-description"</code></summary>

#### What does it do?

Uses AI to find the ending page for each subdocument based on content description *(exclusive)*.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **excluded-end-description** with a description of what content marks the end *(page not included)*. Used with start page marker for non-contiguous subdocuments.

#### Example

```
the excluded end page marker is "Page containing Bill of Lading"
```

</details>

<details>

<summary><code>the subdocument size is n</code></summary>

#### What does it do?

Splits the document into fixed-size chunks or limits subdocument length when used with markers.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **n** with the number of pages per subdocument. The default is `10`.

#### Example

```
the subdocument size is 5
```

</details>

<details>

<summary><code>the subdocument overlap size is x</code></summary>

#### What does it do?

Specifies pages of overlap between consecutive subdocuments.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **x** with the number of overlapping pages. Only used with subdocument size for chunking. The default is `1`.

#### Example

```
the subdocument overlap size is 1
```

</details>

<details>

<summary><code>the openai model is "openai-model"</code></summary>

#### What does it do?

Specifies the OpenAI model to use for marker-based extraction.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **openai-model** with a valid [OpenAI model](https://docs.kognitos.com/llms#available-llm-models). The default is `gpt-4o`.

#### Example

```
the openai model is "gpt-4o"
```

</details>

<details>

<summary><code>the first field is "field-name"</code></summary>

#### What does it do?

Specifies fields to extract from each subdocument for identification.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **field-name** with the name of a field to extract. Used with markers.

#### Example

```
the first field is "invoice number"
```

</details>

<details>

<summary><code>the first field's format is "field-format"</code></summary>

#### What does it do?

Specifies the format of the first field.

#### Where does it go?

Indented under `extract subdocuments from {the source}`.

#### Is it required?

❌ No — This phrase is **optional**.

#### Does it require data?

✅ Yes — Replace **field-format** with "string", "number", or "date".

#### Example

```
the first field's format is "string"
```

</details>

### Examples

#### 1. Extract Non-Contiguous Invoice Subdocuments

Extracts only invoices from a mixed document containing invoices and BOL documents.

```
extract subdocuments from the document where
    the start page marker is "The beginning of a new invoice"
    the excluded end page marker is "Page containing Bill of Lading or the beginning of a new invoice"
```

#### 2. Extract Invoice Subdocuments with Field Extraction

Splits a batch invoice file into individual invoices and extracts key fields.

```
extract subdocuments from the document where
    the start page marker is "The beginning of a new invoice"
    the first field is "invoice number"
    the first field's format is "string"
    the second field is "invoice date"
    the second field's format is "string"
```

#### 3. Extract Invoices with Inclusive End Marker

Extracts invoices from start to a page containing the invoice total (included).

```
extract subdocuments from the document where
    the start page marker is "Page containing invoice header"
    the end page marker is "Page containing invoice total"
```

#### 4. Extract Fixed-Size Chunks with Overlap

Splits a large report into 5-page chunks with 1-page overlap.

```
extract subdocuments from the report where
    the subdocument size is 5
    the subdocument overlap size is 1
```

#### 5. Extract Fixed-Size Chunks without Overlap

Splits a document into 5-page chunks with no overlap.

```
extract subdocuments from the document where
    the subdocument size is 5
```

#### 6. Extract Chapter-Based Subdocuments

Splits a document by identifying chapter beginnings.

```
extract subdocuments from the document where
    the start page marker is "Page containing the text 'Chapter'"
    the openai model is "gpt-4o"
```
