Overview of the PDF integration.
The following documentation is for PDF v1.7.0.
Overview
This integration enables converting PDF files to other formats, providing document transformation capabilities for your automation workflows.
Setup
The following integrations need to be connected to your Kognitos workspace:
PDF
Steps
Follow these steps to connect the integration in Kognitos:
Configure
Add a name for the connection. You'll be prompted for authentication details if needed. Then, click on Connect.
Actions
The following actions are available in the PDF integration:
1. Convert a pdf file to docx
Convert a PDF file to DOCX format.
2. Extract pages from a pdf
Extracts specific pages from a PDF file to create a new PDF.
3. Get a pdf's fields
Gets all form fields from a PDF file.
4. Get a pdf's labels
Gets all text labels (text spans) from a PDF file.
5. Get a pdf's lines
Gets all text lines from a PDF file.
6. Get a pdf's page count
Gets the total number of pages in a PDF file.
7. Read a pdf file
Read text from a PDF file, optionally by page.
8. Remove duplicates from a pdf
Removes duplicate pages from a PDF based on text similarity.
9. Set a pdf field's value
Sets the value of a form field in a PDF.
Concepts
Pdf
Configuration for PDF to DOCX conversion.All fields have optimal defaults. Override specific values as needed.
debug
Set to True for debugging layout issues
optional[boolean]
ignore_page_error
Continue conversion even if a page fails
optional[boolean]
parse_lattice_table
Parse tables with visible borders
optional[boolean]
parse_stream_table
Parse tables without visible borders
optional[boolean]
extract_stream_table
Extract stream tables separately
optional[boolean]
clip_image_res_ratio
Resolution ratio (4x = 288dpi from 72dpi base)
optional[number]
min_section_height
Minimum height for a valid section
optional[number]
max_line_spacing_ratio
Maximum line spacing ratio
optional[number]
line_overlap_threshold
Delete overlapping lines (higher = less aggressive)
optional[number]
line_break_width_ratio
Break line if too narrow
optional[number]
line_break_free_space_ratio
Break line if too much free space
optional[number]
line_separate_threshold
Distance threshold for separate lines
optional[number]
new_paragraph_free_space_ratio
New paragraph threshold
optional[number]
lines_left_aligned_threshold
Left alignment threshold (points)
optional[number]
lines_right_aligned_threshold
Right alignment threshold (points)
optional[number]
lines_center_aligned_threshold
Center alignment threshold (points)
optional[number]
connected_border_tolerance
Border connection tolerance
optional[number]
max_border_width
Maximum border width
optional[number]
min_border_clearance
Minimum clearance between borders
optional[number]
page_margin_factor_top
Top margin reduction factor [0,1]
optional[number]
page_margin_factor_bottom
Bottom margin reduction factor [0,1]
optional[number]
shape_min_dimension
Ignore shapes smaller than this
optional[number]
float_image_ignorable_gap
Float image gap threshold
optional[number]
min_svg_gap_dx
Merge vector graphics horizontal gap
optional[number]
min_svg_gap_dy
Merge vector graphics vertical gap
optional[number]
min_svg_w
Minimum SVG width
optional[number]
min_svg_h
Minimum SVG height
optional[number]
delete_end_line_hyphen
Keep hyphens at line ends
optional[boolean]
multi_processing
Enable for faster conversion of large files
optional[boolean]
cpu_count
0 = use all CPUs, or specify number
optional[number]
Pdf field
Represents a form field extracted from a PDF.
name
The field name
text
value
The current field value
optional[text]
type
The field type (text, checkbox, combobox, etc.)
text
page
Page number (0-indexed)
number
bbox
Bounding box coordinates
json
Pdf bounding box
Represents the bounding box coordinates for a PDF element.
x0
Left x-coordinate
number
y0
Top y-coordinate
number
x1
Right x-coordinate
number
y1
Bottom y-coordinate
number
Pdf label
Represents a text label (text span) extracted from a PDF.
text
The label text content
text
page
Page number (0-indexed)
number
bbox
Bounding box coordinates
json
Pdf line
Represents a text line extracted from a PDF.
text
The complete line text
text
page
Page number (0-indexed)
number
bbox
Bounding box coordinates
json
Last updated
Was this helpful?

