Extract Multiple Subdocuments

Extracts multiple subdocuments from a document using markers or fixed-size chunking.

Overview

This procedure extracts multiple subdocuments from a large document by either identifying recurring content patterns (like invoices or chapters) or by splitting the document into fixed-size chunks with optional overlap. Each subdocument becomes a separate document that can be processed independently, making it ideal for batch processing of multi-document files.

Syntax

Below is a line-by-line overview of the automation syntax. Expand each line to learn more.

extract subdocuments from {the source}

What does it do?

Begins extraction of multiple subdocuments from the specified document.

Where does it go?

This phrase should be written on a new line.

Is it required?

✅ Yes — This phrase is required.

Does it require data?

✅ Yes — Replace the source with a reference to a document or file.

the start page marker is "start-description"

What does it do?

Uses AI to identify pages that mark the beginning of new subdocuments.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace start-description with a description of content that marks new sections.

Example

the start page marker is "The beginning of a new invoice"
the end page marker is "end-description"

What does it do?

Uses AI to find the ending page for each subdocument based on content description (inclusive).

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace end-description with a description of what content marks the end. Used with start page marker for non-contiguous subdocuments.

Example

the end page marker is "Page containing invoice total"
the excluded end page marker is "excluded-end-description"

What does it do?

Uses AI to find the ending page for each subdocument based on content description (exclusive).

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace excluded-end-description with a description of what content marks the end (page not included). Used with start page marker for non-contiguous subdocuments.

Example

the excluded end page marker is "Page containing Bill of Lading"
the subdocument size is n

What does it do?

Splits the document into fixed-size chunks or limits subdocument length when used with markers.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace n with the number of pages per subdocument. The default is 10.

Example

the subdocument overlap size is x

What does it do?

Specifies pages of overlap between consecutive subdocuments.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace x with the number of overlapping pages. Only used with subdocument size for chunking. The default is 1.

Example

the openai model is "openai-model"

What does it do?

Specifies the OpenAI model to use for marker-based extraction.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace openai-model with a valid OpenAI model. The default is gpt-4o.

Example

the first field is "field-name"

What does it do?

Specifies fields to extract from each subdocument for identification.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace field-name with the name of a field to extract. Used with markers.

Example

the first field's format is "field-format"

What does it do?

Specifies the format of the first field.

Where does it go?

Indented under extract subdocuments from {the source}.

Is it required?

❌ No — This phrase is optional.

Does it require data?

✅ Yes — Replace field-format with "string", "number", or "date".

Example

Examples

1. Extract Non-Contiguous Invoice Subdocuments

Extracts only invoices from a mixed document containing invoices and BOL documents.

2. Extract Invoice Subdocuments with Field Extraction

Splits a batch invoice file into individual invoices and extracts key fields.

3. Extract Invoices with Inclusive End Marker

Extracts invoices from start to a page containing the invoice total (included).

4. Extract Fixed-Size Chunks with Overlap

Splits a large report into 5-page chunks with 1-page overlap.

5. Extract Fixed-Size Chunks without Overlap

Splits a document into 5-page chunks with no overlap.

6. Extract Chapter-Based Subdocuments

Splits a document by identifying chapter beginnings.

Last updated

Was this helpful?