Textract - Document extraction

Working with Textract OCR on Kognitos

Introduction

This manual provides comprehensive guidance on using Textract with Kognitos. By default this is the OCR set for your department - you can also learn this through the Document Processing Book when trying to jump between different OCRs

1: Setting Up for Textract

Integrating Amazon Textract with Kognitos enables the extraction of text, forms, and tables from images or PDFs, transforming them into actionable, searchable data. This section guides you through understanding Textract's capabilities and preparing your documents for optimal results.

1.1 Understanding Textract Capabilities

Amazon Textract goes beyond simple OCR (Optical Character Recognition) to identify the contents of fields in forms and information stored in tables. It's designed to understand the layout and structure of documents, which allows for more accurate extraction of data. Textract can handle a variety of document formats, including scanned documents, PDFs, and photos of documents.

Key capabilities include:

  • Text Extraction: Retrieves lines and words of text from documents, maintaining their layout.
  • Form Extraction: Identifies form labels and the corresponding data.
  • Table Recognition: Extracts tables from documents, recognizing rows, columns, and cells.

Understanding these capabilities is crucial for effectively using Textract with Kognitos to automate data extraction tasks, enhance data entry processes, and streamline document workflows.


2: Extracting Document Content

Integrating Amazon Textract with Kognitos enables users to extract specific content from documents, transforming scanned documents into actionable, searchable data. This section guides you through the process of basic content extraction and adjusting OCR confidence thresholds to ensure data accuracy.

2.1 Basic Content Extraction

Extracting specific pieces of information, such as dates and invoice numbers, from documents is a straightforward process in Kognitos when integrated with Textract. This functionality is crucial for automating data entry, enhancing document management systems, and streamlining workflows.

  • Extracting Dates from Documents
    Extracting dates is a common requirement for processing documents like invoices, contracts, and letters. Textract accurately identifies and extracts date formats from the document content.
    Example:

    get the document
    get the document's date # <----
    
    

    This command retrieves the date from the document, simplifying tasks such as document sorting, archiving, and compliance checks.

  • Extracting Invoice Numbers
    Invoice processing often requires extracting unique identifiers like invoice numbers. Textract's ability to recognize and extract such information reduces manual review and data entry errors.
    Example:

    get the document
    the department's OCR confidence threshold is 80
    get the document's invoice number # <----
    

    Mentioning the OCR confidence is not necessary to run this successfully.
    Setting a higher OCR confidence threshold ensures that the extracted invoice number is accurate, enhancing data reliability for further processing.
    Example:

    get the document
    get the document's invoice number # <----
    

2.2 Using OCR Confidence Threshold

The OCR confidence threshold is a critical parameter that determines the minimum confidence level required to consider the OCR result as valid. Adjusting this threshold can help balance between accuracy and the breadth of data extracted.

  • Adjusting the OCR Confidence Threshold
    Depending on the document's quality and the importance of accuracy for the use case, adjusting the OCR confidence threshold can significantly impact the results. A higher threshold reduces the risk of incorrect data extraction but may omit less clear text.
    Example:

    get the document
    the department's OCR confidence threshold is 80
    get the document's invoice number # <----
    
    

    In this example, setting the OCR confidence threshold to 80 ensures that only data recognized with high confidence is extracted, suitable for scenarios where accuracy is paramount.

By leveraging Textract's capabilities with Kognitos, users can automate the extraction of critical information from documents, significantly reducing manual data entry and enhancing data accuracy. The ability to adjust the OCR confidence threshold further refines this process, allowing for customization based on specific needs and document quality.


3: Advanced Textract Operations

Leveraging Amazon Textract's advanced features through Kognitos enables more complex document processing tasks, such as extracting specific types of information through queries. This section explores how to utilize Textract queries for targeted information extraction and manage multiple queries for comprehensive document analysis.

3.1 Extracting with Textract Queries

Textract queries allow for the extraction of particular data points from documents, making it possible to focus on specific information like dates, names, invoice numbers, and more. This targeted approach is invaluable for applications requiring precise data extraction from varied document formats.

  • Extracting Specific Information Using Queries
    When you need to extract specific types of information from a document, Textract queries can be defined to pinpoint exactly what you're looking for. This capability is crucial for processing documents where only certain data points are relevant to your workflow.
    Example:

    get the scanned document
    the document's textract queries are "DATE", "NAME"
    the scanned document's textract is # <----
    
    

    This command uses Textract queries to extract dates and names from the scanned document, demonstrating how to retrieve multiple types of information simultaneously.

  • Handling Multiple Queries
    For comprehensive document analysis, handling multiple queries efficiently is key. Textract's integration with Kognitos allows for the execution of several queries in a single operation, streamlining the data extraction process.
    Example:

    get the scanned document
    the scanned document's textract is # <----
    
    

    This example, while not specifying the queries explicitly in the snippet, implies the capability to handle multiple Textract queries, extracting a wide range of information based on the document's content.

3.2 Managing Query Results

Once Textract queries have been executed, managing and utilizing the extracted data effectively is crucial. The results, typically returned as a JSON object, contain valuable information that can be integrated into databases, used for further analysis, or leveraged to automate business processes.

  • Interpreting Textract Query Results
    Understanding the structure of Textract's query results is essential for extracting actionable insights. The JSON output includes details such as the extracted text, confidence levels, and the location of the text within the document, enabling precise data handling and integration.
  • Utilizing Extracted Data
    The data extracted via Textract queries can be used in various ways, depending on your application's needs. Whether updating records in a database, triggering workflows based on extracted information, or conducting further analysis, the flexibility of Kognitos allows for seamless integration of Textract data into your processes.

4: Using Extracted Data in Kognitos

With the extracted information now structured and accessible, you can leverage Kognitos to automate further operations, such as:

  • Updating records in a database based on extracted form data.
  • Triggering workflows or alerts based on specific data points, like due dates or payment amounts from invoices.
  • Conducting data analysis by integrating extracted table data into analytics platforms.

By mastering advanced Textract operations with Kognitos, users can unlock powerful document processing capabilities, extracting specific information with precision and integrating extracted data into broader workflows and systems. This advanced functionality enhances the automation potential of document-based processes, driving efficiency and accuracy across operations.