Extract Multiple Subdocuments

Extracts multiple subdocuments from a document using markers or fixed-size chunking.

Overview

This procedure extracts multiple subdocuments from a large document by either identifying recurring content patterns (like invoices or chapters) or by splitting the document into fixed-size chunks with optional overlap. Each subdocument becomes a separate document that can be processed independently, making it ideal for batch processing of multi-document files.

Syntax

Below is a line-by-line overview of the automation syntax. Expand each line to learn more.

extract subdocuments from {the source}

What does it do?

Begins extraction of multiple subdocuments from the specified document.

Where does it go?

This phrase should be written on a new line.

Is it required?

✅ Yes — This phrase is required.

Does it require data?

✅ Yes — Replace the source with a reference to a document or file.

the start page marker is "start-description"

What does it do?

Uses AI to identify pages that mark the beginning of new subdocuments.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace start-description with a description of content that marks new sections.

Example

the start page marker is "The beginning of a new invoice"
the subdocument size is n

What does it do?

Splits the document into fixed-size chunks.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace n with the number of pages per subdocument. The default is 10.

Example

the subdocument size is 5
the subdocument overlap size is x

What does it do?

Specifies pages of overlap between consecutive subdocuments.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace x with the number of overlapping pages. Only used with subdocument size. The default is 1.

Example

the subdocument overlap size is 1
the openai model is "openai-model"

What does it do?

Specifies the OpenAI model to use for marker-based extraction.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace openai-model with a valid OpenAI model. The default is gpt-4o.

Example

the openai model is "gpt-4o"
the first field is "field-name"

What does it do?

Specifies fields to extract from each subdocument for identification.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace field-name with the name of a field to extract. Used with markers.

Example

the first field is "invoice number"
the first field's format is "field-format"

What does it do?

Specifies the format of the first field.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace field-format with "string", "number", or "date".

Example

the first field's format is "string"

Examples

1. Extract Invoice Subdocuments with Field Extraction

Splits a batch invoice file into individual invoices and extracts key fields.

extract subdocuments from the document where
    the start page marker is "The beginning of a new invoice"
    the first field is "invoice number"
    the first field's format is "string"
    the second field is "invoice date"
    the second field's format is "string"

2. Extract Fixed-Size Chunks with Overlap

Splits a large report into 5-page chunks with 1-page overlap.

extract subdocuments from the report where
    the subdocument size is 5
    the subdocument overlap size is 1

3. Extract Fixed-Size Chunks without Overlap

Splits a document into 5-page chunks with no overlap.

extract subdocuments from the document where
    the subdocument size is 5

4. Extract Chapter-Based Subdocuments

Splits a document by identifying chapter beginnings.

extract subdocuments from the document where
    the start page marker is "Page containing the text 'Chapter'"
    the openai model is "gpt-4o"

Last updated

Was this helpful?