Extracts multiple subdocuments from a document using markers or fixed-size chunking.
Overview
This procedure extracts multiple subdocuments from a large document by either identifying recurring content patterns (like invoices or chapters) or by splitting the document into fixed-size chunks with optional overlap. Each subdocument becomes a separate document that can be processed independently, making it ideal for batch processing of multi-document files.
Make sure to add the Document Processing Book to your agent before using this automation procedure.
Syntax
Below is a line-by-line overview of the automation syntax. Expand each line to learn more.
extract subdocuments from {the source}
What does it do?
Begins extraction of multiple subdocuments from the specified document.
Where does it go?
This phrase should be written on a new line.
Is it required?
✅ Yes — This phrase is required.
Does it require data?
✅ Yes — Replace the source with a reference to a document or file.
the start page marker is "start-description"
What does it do?
Uses AI to identify pages that mark the beginning of new subdocuments.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace start-description with a description of content that marks new sections.
Example
the start page marker is "The beginning of a new invoice"
the end page marker is "end-description"
What does it do?
Uses AI to find the ending page for each subdocument based on content description (inclusive).
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace end-description with a description of what content marks the end. Used with start page marker for non-contiguous subdocuments.
Example
the end page marker is "Page containing invoice total"
the excluded end page marker is "excluded-end-description"
What does it do?
Uses AI to find the ending page for each subdocument based on content description (exclusive).
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace excluded-end-description with a description of what content marks the end (page not included). Used with start page marker for non-contiguous subdocuments.
Example
the excluded end page marker is "Page containing Bill of Lading"
the subdocument size is n
What does it do?
Splits the document into fixed-size chunks or limits subdocument length when used with markers.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace n with the number of pages per subdocument. The default is 10.
Example
the subdocument overlap size is x
What does it do?
Specifies pages of overlap between consecutive subdocuments.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace x with the number of overlapping pages. Only used with subdocument size for chunking. The default is 1.
Example
the openai model is "openai-model"
What does it do?
Specifies the OpenAI model to use for marker-based extraction.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace openai-model with a valid OpenAI model. The default is gpt-4o.
Example
the first field is "field-name"
What does it do?
Specifies fields to extract from each subdocument for identification.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace field-name with the name of a field to extract. Used with markers.
Example
the first field's format is "field-format"
What does it do?
Specifies the format of the first field.
Where does it go?
Indented under extract subdocuments from {the source}.
Is it required?
❌ No — This phrase is optional.
Does it require data?
✅ Yes — Replace field-format with "string", "number", or "date".
Example
Examples
1. Extract Non-Contiguous Invoice Subdocuments
Extracts only invoices from a mixed document containing invoices and BOL documents.
2. Extract Invoice Subdocuments with Field Extraction
Splits a batch invoice file into individual invoices and extracts key fields.
3. Extract Invoices with Inclusive End Marker
Extracts invoices from start to a page containing the invoice total (included).
4. Extract Fixed-Size Chunks with Overlap
Splits a large report into 5-page chunks with 1-page overlap.
5. Extract Fixed-Size Chunks without Overlap
Splits a document into 5-page chunks with no overlap.
6. Extract Chapter-Based Subdocuments
Splits a document by identifying chapter beginnings.
extract subdocuments from the document where
the start page marker is "The beginning of a new invoice"
the excluded end page marker is "Page containing Bill of Lading or the beginning of a new invoice"
extract subdocuments from the document where
the start page marker is "The beginning of a new invoice"
the first field is "invoice number"
the first field's format is "string"
the second field is "invoice date"
the second field's format is "string"
extract subdocuments from the document where
the start page marker is "Page containing invoice header"
the end page marker is "Page containing invoice total"
extract subdocuments from the report where
the subdocument size is 5
the subdocument overlap size is 1
extract subdocuments from the document where
the subdocument size is 5
extract subdocuments from the document where
the start page marker is "Page containing the text 'Chapter'"
the openai model is "gpt-4o"