LogoLogo
About
  • Home
  • Guides
  • BDK
  • REST API
  • Release Notes
  • Large Language Models
    • Overview
    • Ask Koncierge
  • Ask Koncierge To Extract Information From A Document
  • Extract Data
  • Extract Pages
  • Extract a Subdocument
  • Extract Subdocuments
  • Extract Tables
  • Identify Elements in a Text
Powered by GitBook
On this page
  • Overview
  • Syntax
  • Inputs
  • Example 1
  • Example 2

Was this helpful?

Export as PDF

Extract Data

Extracts data from sources like texts, documents, or files using LLM models.

Last updated 9 days ago

Was this helpful?

Before using this procedure, ensure you have the Document Processing Book in your agent. After learning the Book, make sure to publish your Agent and create a new Playground for it to take effect.

Overview

This procedure extracts data from texts, documents, and files. Using AI models, it identifies and retrieves the text content, making it easy to access and work with the information in those documents. This procedure can be used to get information from a PDF, document, or file.

Syntax

extract data from the {source}
  the creativity is {creativity}
  the openai model is "{model}"
  the gemini model is "{gemini model}"
  the output format is "{output format}"
  the visual reference is {visual reference}

  the first field is "{field}"
  the first field's rule is "{field rule}"
  the first field's default value is "{field default}"
  the first field's format is "{field format}"

Inputs

Required

  1. source

    • A variable to represent the information source to extract data from

    • Examples: document, file, text.

  2. field

    • A field to be extracted from the text or file.

    • You can add any number of fields, but at least 1 is required.

    • Examples: invoice number, date.

Optional

  1. creativity

    • A number that controls the creativity of the response. Higher values produce more creative responses.

    • Default: 0.0

    • Range: 0.0-2.0.

  2. model

    • The OpenAI model to use to generate the response.

    • Default: gpt-4o-latest

  3. gemini model

    • Specifies the Gemini model to use to generate the response.

    • A model must be specified when using this field; no default is set.

    • Example: gemini-2.5-pro

  4. output format

    • The desired format of the response.

    • Default: string

    • Allowed Values:

      • string

      • text

      • date

      • table

      • list of texts

      • list of numbers

      • list of dates

      • list of records

      • structured data

      • json

  5. visual reference

    • A variable to represent a document or image that serves as a visual reference, helping GPT enhance its accuracy. This refers to data defined earlier in the automation.

    • Examples the file, the document, the text

  6. field rule

    • The rule to be followed for the field extraction.

  7. field default

    • The field's default value.

  8. field format

  • The format of the fields that need to be extracted from the text.

  • Possible values: number, string, date.

Example 1

extract data from the file
  the openai model is "gpt-4o-latest"
  the visual reference is the file

  the first field is "invoice number"
  the first field's format is "number"
  the first field's rule is "keep just the first four digits"

  the second field is "invoice amounts"
  the second field's format is "string"
  the second field's rule is "keep just the amount without the currency"

  the third field is "invoice date"
  the third field's format is "date"

the data's invoice amounts
get the invoice number from the data

Example 2

extract data from the text
   the text is "Recently watched films: 'Inception', 'The Matrix', and 'The Godfather'
   the output format is "list of texts"
   the first field is "movies"
   the first field's rule is "Select the movies that are sci-fi"
movies:
- Inception
- The Matrix
learned