Extract Multiple Subdocuments

Extracts multiple subdocuments from a document using markers or fixed-size chunking.

Overview

This procedure extracts multiple subdocuments from a large document by either identifying recurring content patterns (like invoices or chapters) or by splitting the document into fixed-size chunks with optional overlap. Each subdocument becomes a separate document that can be processed independently, making it ideal for batch processing of multi-document files.

Syntax

Below is a line-by-line overview of the automation syntax. Expand each line to learn more.

extract subdocuments from {the source}

What does it do?

Begins extraction of multiple subdocuments from the specified document.

Where does it go?

This phrase should be written on a new line.

Is it required?

✅ Yes — This phrase is required.

Does it require data?

✅ Yes — Replace the source with a reference to a document or file.

the start page marker is "start-description"

What does it do?

Uses AI to identify pages that mark the beginning of new subdocuments.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace start-description with a description of content that marks new sections.

Example

the start page marker is "The beginning of a new invoice"
the end page marker is "end-description"

What does it do?

Uses AI to find the ending page for each subdocument based on content description (inclusive).

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace end-description with a description of what content marks the end. Used with start page marker for non-contiguous subdocuments.

Example

the end page marker is "Page containing invoice total"
the excluded end page marker is "excluded-end-description"

What does it do?

Uses AI to find the ending page for each subdocument based on content description (exclusive).

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace excluded-end-description with a description of what content marks the end (page not included). Used with start page marker for non-contiguous subdocuments.

Example

the excluded end page marker is "Page containing Bill of Lading"
the subdocument size is n

What does it do?

Splits the document into fixed-size chunks or limits subdocument length when used with markers.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace n with the number of pages per subdocument. The default is 10.

Example

the subdocument size is 5
the subdocument overlap size is x

What does it do?

Specifies pages of overlap between consecutive subdocuments.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace x with the number of overlapping pages. Only used with subdocument size for chunking. The default is 1.

Example

the subdocument overlap size is 1
the openai model is "openai-model"

What does it do?

Specifies the OpenAI model to use for marker-based extraction.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace openai-model with a valid OpenAI model. The default is gpt-4o.

Example

the openai model is "gpt-4o"
the first field is "field-name"

What does it do?

Specifies fields to extract from each subdocument for identification.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace field-name with the name of a field to extract. Used with markers.

Example

the first field is "invoice number"
the first field's format is "field-format"

What does it do?

Specifies the format of the first field.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace field-format with "string", "number", or "date".

Example

the first field's format is "string"

Examples

1. Extract Non-Contiguous Invoice Subdocuments

Extracts only invoices from a mixed document containing invoices and BOL documents.

extract subdocuments from the document where
    the start page marker is "The beginning of a new invoice"
    the excluded end page marker is "Page containing Bill of Lading or the beginning of a new invoice"

2. Extract Invoice Subdocuments with Field Extraction

Splits a batch invoice file into individual invoices and extracts key fields.

extract subdocuments from the document where
    the start page marker is "The beginning of a new invoice"
    the first field is "invoice number"
    the first field's format is "string"
    the second field is "invoice date"
    the second field's format is "string"

3. Extract Invoices with Inclusive End Marker

Extracts invoices from start to a page containing the invoice total (included).

extract subdocuments from the document where
    the start page marker is "Page containing invoice header"
    the end page marker is "Page containing invoice total"

4. Extract Fixed-Size Chunks with Overlap

Splits a large report into 5-page chunks with 1-page overlap.

extract subdocuments from the report where
    the subdocument size is 5
    the subdocument overlap size is 1

5. Extract Fixed-Size Chunks without Overlap

Splits a document into 5-page chunks with no overlap.

extract subdocuments from the document where
    the subdocument size is 5

6. Extract Chapter-Based Subdocuments

Splits a document by identifying chapter beginnings.

extract subdocuments from the document where
    the start page marker is "Page containing the text 'Chapter'"
    the openai model is "gpt-4o"

Last updated

Was this helpful?