Extract Multiple Subdocuments
Extracts multiple subdocuments from a document using markers or fixed-size chunking.
Overview
This procedure extracts multiple subdocuments from a large document by either identifying recurring content patterns (like invoices or chapters) or by splitting the document into fixed-size chunks with optional overlap. Each subdocument becomes a separate document that can be processed independently, making it ideal for batch processing of multi-document files.
Make sure to add the Document Processing Book to your agent before using this automation procedure.
Syntax
Below is a line-by-line overview of the automation syntax. Expand each line to learn more.
Examples
1. Extract Non-Contiguous Invoice Subdocuments
Extracts only invoices from a mixed document containing invoices and BOL documents.
extract subdocuments from the document where
the start page marker is "The beginning of a new invoice"
the excluded end page marker is "Page containing Bill of Lading or the beginning of a new invoice"2. Extract Invoice Subdocuments with Field Extraction
Splits a batch invoice file into individual invoices and extracts key fields.
extract subdocuments from the document where
the start page marker is "The beginning of a new invoice"
the first field is "invoice number"
the first field's format is "string"
the second field is "invoice date"
the second field's format is "string"3. Extract Invoices with Inclusive End Marker
Extracts invoices from start to a page containing the invoice total (included).
extract subdocuments from the document where
the start page marker is "Page containing invoice header"
the end page marker is "Page containing invoice total"4. Extract Fixed-Size Chunks with Overlap
Splits a large report into 5-page chunks with 1-page overlap.
extract subdocuments from the report where
the subdocument size is 5
the subdocument overlap size is 15. Extract Fixed-Size Chunks without Overlap
Splits a document into 5-page chunks with no overlap.
extract subdocuments from the document where
the subdocument size is 56. Extract Chapter-Based Subdocuments
Splits a document by identifying chapter beginnings.
extract subdocuments from the document where
the start page marker is "Page containing the text 'Chapter'"
the openai model is "gpt-4o"Last updated
Was this helpful?
