Extract Multiple Subdocuments
Extracts multiple subdocuments from a document using markers or fixed-size chunking.
Overview
This procedure extracts multiple subdocuments from a large document by either identifying recurring content patterns (like invoices or chapters) or by splitting the document into fixed-size chunks with optional overlap. Each subdocument becomes a separate document that can be processed independently, making it ideal for batch processing of multi-document files.
Make sure to add the Document Processing Book to your agent before using this automation procedure.
Syntax
Below is a line-by-line overview of the automation syntax. Expand each line to learn more.
the start page marker is "start-description"
What does it do?
Uses AI to identify pages that mark the beginning of new subdocuments.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace start-description with a description of content that marks new sections.
Example
the start page marker is "The beginning of a new invoice"the end page marker is "end-description"
What does it do?
Uses AI to find the ending page for each subdocument based on content description (inclusive).
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace end-description with a description of what content marks the end. Used with start page marker for non-contiguous subdocuments.
Example
the end page marker is "Page containing invoice total"the excluded end page marker is "excluded-end-description"
What does it do?
Uses AI to find the ending page for each subdocument based on content description (exclusive).
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace excluded-end-description with a description of what content marks the end (page not included). Used with start page marker for non-contiguous subdocuments.
Example
the excluded end page marker is "Page containing Bill of Lading"the subdocument size is n
What does it do?
Splits the document into fixed-size chunks or limits subdocument length when used with markers.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace n with the number of pages per subdocument. The default is 10.
Example
the subdocument overlap size is x
What does it do?
Specifies pages of overlap between consecutive subdocuments.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace x with the number of overlapping pages. Only used with subdocument size for chunking. The default is 1.
Example
the openai model is "openai-model"
What does it do?
Specifies the OpenAI model to use for marker-based extraction.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace openai-model with a valid OpenAI model. The default is gpt-4o.
Example
the first field is "field-name"
What does it do?
Specifies fields to extract from each subdocument for identification.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace field-name with the name of a field to extract. Used with markers.
Example
Examples
1. Extract Non-Contiguous Invoice Subdocuments
Extracts only invoices from a mixed document containing invoices and BOL documents.
2. Extract Invoice Subdocuments with Field Extraction
Splits a batch invoice file into individual invoices and extracts key fields.
3. Extract Invoices with Inclusive End Marker
Extracts invoices from start to a page containing the invoice total (included).
4. Extract Fixed-Size Chunks with Overlap
Splits a large report into 5-page chunks with 1-page overlap.
5. Extract Fixed-Size Chunks without Overlap
Splits a document into 5-page chunks with no overlap.
6. Extract Chapter-Based Subdocuments
Splits a document by identifying chapter beginnings.
Last updated
Was this helpful?
