What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?
- Pipeline/Backend:
StandardPdfPipeline+DoclingParseDocumentBackend(default:docling_parse) - Key Options:
from_formats: Supported input formats includedocx,pptx,html,image,pdf,asciidoc,md(includingtxt,text,qmd,rmd),csv,xlsx,xml_uspto,xml_jats,xml_xbrl,mets_gbs,json_docling,audio,vtt,latexto_formats: Supported output formats includemd,json,yaml,html,html_split_page,text,doctags,vttpdf_backend: Allowed values:pypdfium2,docling_parse,dlparse_v1,dlparse_v2,dlparse_v4(default:docling_parse)do_ocr(default True): Use OCRforce_ocr: Replace existing text with OCR-generated textocr_engine,ocr_lang: OCR engine and language optionsimage_export_mode:placeholder,embedded,referenceddo_table_structure,table_mode,table_cell_matching: Table extraction options (see Table Structure Models section below for details on TableFormer V1 and V2)do_code_enrichment,do_formula_enrichment: Code/formula recognitionvlm_pipeline_preset,vlm_pipeline_custom_config,picture_description_preset,picture_description_custom_config,code_formula_preset,code_formula_custom_config: New model inference engine and preset options for VLM, picture description, and code/formula extraction- Deprecated:
picture_description_local,picture_description_api,vlm_pipeline_model,vlm_pipeline_model_local,vlm_pipeline_model_api(use new preset/custom config options instead) - Picture Description Filtering (for both PictureDescriptionAPI and PictureDescriptionLocal):
classification_allow(List[PictureClassificationLabel] or NoneType): Only describe pictures whose predicted class is in this allow-listclassification_deny(List[PictureClassificationLabel] or NoneType): Do not describe pictures whose predicted class is in this deny-listclassification_min_confidence(float): Minimum classification confidence required before a picture can be described
images_scale: Image resolution multiplier (default 1.0; CLI default 2.0; values >2.0 may cause bugs)generate_page_images,generate_picture_images: Extract page/picture imagesforce_backend_text: Force backend text extractionlayout_custom_config,table_structure_custom_config: Custom model configs for layout/table structure (see Table Structure Models section below)- Additional options for chart extraction, picture description, and more
Table Structure Models
Docling supports two versions of the TableFormer model for table structure recognition. Both models extract table structure from PDF documents but differ in their configuration and text extraction behavior.
TableFormer V1
The original TableFormer model, configured using TableStructureOptions:
- Configuration Class:
TableStructureOptions - Kind:
"docling_tableformer" - Key Options:
do_cell_matching: Controls cell matching behavior. The default is complex (depends ontable_modeand backend):- When
True: Matches predictions back to PDF cells using a separate cell matching pipeline - When
False: Let table structure model define the text cells, ignore PDF cells
- When
table_mode: Controls accuracy vs. speed tradeoff
- Text Extraction: Has a separate cell matching pipeline that can match predictions back to PDF cells
- CLI Model Download:
docling models download --model tableformer
TableFormer V2
The newer TableFormer model with improved performance and simplified configuration:
- Configuration Class:
TableStructureV2Options - Kind:
"docling_tableformer_v2" - Key Options:
do_cell_matching(default:True): Controls cell matching behavior- When
True: Matches predictions back to PDF cells. Can break table output if PDF cells are merged across table columns. - When
False: Let table structure model define the text cells, ignore PDF cells.
- When
- Text Extraction:
- When
do_cell_matching=True: Prefers text from cluster cells (which includes OCR-assigned cells), falling back to PDF backend text extraction if cluster cell text is empty - When
do_cell_matching=False: Extracts text directly from the PDF backend
- When
- CLI Model Download:
docling models download --model tableformerv2
Usage Note: The table_structure_custom_config option in PdfPipelineOptions can be used to specify custom model configurations for either TableFormer V1 or V2.
PDF (continued)
- Pipeline Option Overrides: The Python API allows you to override pipeline options at conversion time for a given format using the
format_optionsargument. Onlydo_*flags (such asdo_ocr,do_table_structure,do_code_enrichment,do_formula_enrichment, etc.) can be changed, and only fromTruetoFalse. All other options must remain identical to those used at pipeline initialization. Attempting to enable a do_* flag or change other fields will result in an error. This enables per-call disabling of enrichment features without reinitializing the pipeline. - Exporting Scanned/Image-Based PDFs: When processing scanned or image-based PDFs with
force_full_page_ocr=True, the layout model classifies full-page scans asPictureItemand OCR text is stored as children of those picture nodes. To export this OCR text viaexport_to_markdown()orexport_to_text(), you must set thetraverse_pictures=Trueparameter. Without this parameter, export functions will return empty results even though OCR text exists in the document.
# Required for scanned/image-based PDFs processed with full-page OCR
result = converter.convert(source="scanned.pdf")
text = result.document.export_to_text(traverse_pictures=True)
markdown = result.document.export_to_markdown(traverse_pictures=True)
- Notes: Only PDF supports image resolution adjustment. For more details, see pipeline options code and example. Refer to the Python SDK documentation for usage of
format_options. See the API reference for details on new preset/custom config fields and deprecated options.
DOCX
- Pipeline/Backend:
SimplePipeline+MsWordDocumentBackend - Key Options:
- Enrichment options (code, formula, chart, image description)
- Header/Footer Export: Only supported via Python API by setting
included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE}; default export excludes header/footer
- Processing:
- Multiple Equations in Paragraphs: When a DOCX paragraph contains multiple sibling OMML equations (e.g., multiple
<m:oMath>elements), each equation is extracted as a separateFORMULAitem in the document structure. This applies to both:- Standalone equation paragraphs: Paragraphs containing only equations (no surrounding text) produce multiple separate
FORMULAitems, one for each equation - Inline equations: Multiple equations within text-containing paragraphs are preserved as distinct formula items
- Standalone equation paragraphs: Paragraphs containing only equations (no surrounding text) produce multiple separate
- Previously, multiple sibling equations in a single paragraph were concatenated into a single LaTeX string, but this has been fixed to maintain each equation as a separate document item
- Multiple Equations in Paragraphs: When a DOCX paragraph contains multiple sibling OMML equations (e.g., multiple
- Notes: Header/footer are automatically detected as FURNITURE layer. CLI/Serve API exports only BODY. Example.
PPTX
- Pipeline/Backend:
SimplePipeline+MsPowerpointDocumentBackend - Key Options:
ConvertPipelineOptions(enrichment: image classification/description, chart extraction)PaginatedPipelineOptions(image scaling, page image generation)
- Processing:
- Each slide is treated as a page
- Extracts text (paragraphs, lists, indentation, master styles), images (using PIL), tables (cell/span/header), slide notes (furniture)
- Tables and images include provenance (location info)
- Notes: Image resolution adjustment is not supported (depends on backend quality). Pipeline code reference.
XLSX
- Pipeline/Backend:
SimplePipeline+MsExcelDocumentBackend - Key Options:
treat_singleton_as_text(default False): Treat 1x1 cells as TextItemgap_tolerance(default 0): Table merging tolerance for empty cells- Enrichment options (image description, chart extraction)
- Processing:
- Each sheet is treated as a page
- Table detection via flood-fill, image extraction (bounding box based on cell anchor)
- Includes provenance (location info), auto page size calculation
- Notes: Table detection algorithm and singleton cell handling are configurable. Backend options code.
Markdown
- Pipeline/Backend:
SimplePipeline+MarkdownDocumentBackend - Supported File Extensions:
.md(Markdown),.txtand.text(plain-text),.qmd(Quarto Markdown),.rmdand.Rmd(R Markdown) - MIME Type Support:
text/markdown,text/x-markdown,text/plain - Processing:
- Parses Markdown syntax to extract structured content (headings, paragraphs, lists, code blocks, etc.)
- Plain-text files (
.txt,.text) and files withtext/plainMIME type are processed through the Markdown backend - Markdown supersets (Quarto and R Markdown) are supported; the backend handles prose and heading structure, while language-specific code chunk metadata (e.g.,
{r},{python}) is passed through as fenced code blocks
- Notes: USPTO patent files distributed as
.txtfiles (APS format, identified by aPATN\r\nprefix) are detected and routed to the USPTO XML backend instead of the Markdown backend.
HTML
- Pipeline/Backend:
SimplePipeline+HTMLDocumentBackend - Installation Requirements: HTML rendering (with headless browser support) requires the
htmlrenderextra:pip install docling[htmlrender]. This installs Playwright and related dependencies. - Key Options (
HTMLBackendOptions):render_page(bool, default: False): Enable headless browser rendering to capture page images and element bounding boxesrender_page_width(int, default: 794): Render page width in CSS pixels (A4 @ 96 DPI)render_page_height(int, default: 1123): Render page height in CSS pixels (A4 @ 96 DPI)render_page_orientation(Literal["portrait", "landscape"], default: "portrait"): Page orientationrender_print_media(bool, default: True): Use print media emulation when renderingrender_wait_until(Literal["load", "domcontentloaded", "networkidle"], default: "networkidle"): Playwright wait condition before extracting DOMrender_wait_ms(int, default: 0): Extra delay in milliseconds after loadrender_device_scale(float, default: 1.0): Device scale factor for renderingpage_padding(int, default: 0): Padding in CSS pixels applied to HTML body before renderingrender_full_page(bool, default: False): Capture a single full-height page image instead of paginatingrender_dpi(int, default: 96): DPI used for page images created from renderingfetch_images(bool, default: False): Fetch and embed images from the HTMLenable_local_fetch(bool): Enable fetching resources from the local filesystemsource_uri(Path or str): Base URI for resolving relative paths in HTML
- Processing:
- Reading order is preserved from the HTML DOM tree
- Supports HTML form elements: checkboxes, radio buttons, text inputs, and other input fields
- Supports key-value pair conventions where HTML elements with matching IDs (e.g., "key1" and "key1_value1") are automatically paired as key-value relationships
- When
render_page=True, uses Playwright headless browser to materialize HTML pages into images - Adds provenances with bounding boxes to all elements in the converted document when rendering is enabled
- Can handle local file paths and remote URLs
- Heuristic that glues independent inline HTML elements with single-character text into larger text blocks
- Support for inline styling (bold, italic, etc.)
- Usage Example:
from pathlib import Path
from docling.datamodel.backend_options import HTMLBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, HTMLFormatOption
html_options = HTMLBackendOptions(
render_page=True,
render_page_width=794,
render_page_height=1123,
render_device_scale=2.0,
render_page_orientation="portrait",
render_print_media=True,
render_wait_until="networkidle",
render_wait_ms=500,
render_full_page=True,
render_dpi=144,
page_padding=16,
fetch_images=True,
)
converter = DocumentConverter(
format_options={
InputFormat.HTML: HTMLFormatOption(backend_options=html_options)
}
)
result = converter.convert("path/to/file.html")
doc = result.document
LaTeX
- Pipeline/Backend:
SimplePipeline+LatexDocumentBackend - Key Options (
LatexBackendOptions):parse_timeout(default: 30.0 seconds): Maximum time allowed for parsing a LaTeX document. Set toNoneto disable the timeout. This preventspylatexencfrom spinning indefinitely when parsing legacy arXiv documents with complex or malformed macroscopic environments. If parsing exceeds this timeout, the conversion will fall back to raw text extraction rather than structured parsing. A warning will be logged when a timeout occurs.
- Processing:
- Parses LaTeX source using
pylatexencto extract structured content (sections, equations, tables, etc.) - Pre-processes custom macros (e.g.,
\be/\eeshortcuts for equations) - Timeout enforcement runs parsing in a daemon thread to allow graceful fallback on timeout
- Parses LaTeX source using
- Notes: The
parse_timeoutoption is particularly useful for processing legacy arXiv documents that may contain complex or malformed macro environments. To configure the timeout:
from docling.datamodel.backend_options import LatexBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, LatexFormatOption
# Increase timeout to 60 seconds
latex_options = LatexBackendOptions(
parse_timeout=60.0
)
converter = DocumentConverter(
format_options={
InputFormat.LATEX: LatexFormatOption(backend_options=latex_options)
}
)
# Or disable timeout entirely
latex_options = LatexBackendOptions(
parse_timeout=None
)
converter = DocumentConverter(
format_options={
InputFormat.LATEX: LatexFormatOption(backend_options=latex_options)
}
)
XBRL
XBRL (eXtensible Business Reporting Language) is a standard XML-based format used globally by companies, regulators, and financial institutions for exchanging business and financial information in a structured, machine-readable format. It's widely adopted for regulatory filings (e.g., SEC filings in the US).
- Pipeline/Backend:
SimplePipeline+XBRLDocumentBackend - Key Options (
XBRLBackendOptions):enable_local_fetch(default: True): Enable fetching taxonomy files from the local filesystemenable_remote_fetch(default: True): Enable fetching taxonomy files from remote URLstaxonomy: Path to local taxonomy directory containing schema and linkbase files
- Processing:
- Parses XBRL instance documents and validates against taxonomy
- Extracts metadata, text blocks, and numeric facts with comprehensive enrichment
- Converts HTML text blocks to structured content
- Numeric facts are extracted as key-value pairs with graph representation, including:
- Period: Distinguishes between instant (point-in-time) and duration (start-end) data
- Unit/Currency: Captures the measurement unit (e.g., USD, shares)
- Decimals: Captures decimal precision information
- Dimensions: Captures dimensional context (e.g., geographical segments, product lines)
- Presentation Linkbase: Captures parent-child relationships that define the hierarchical structure of concepts in the taxonomy
- Calculation Linkbase: Captures summation relationships between concepts, including weights that indicate how child items contribute to parent totals
- Supports offline parsing with local taxonomy packages
- Configuration Example:
from docling.datamodel.backend_options import XBRLBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, XBRLFormatOption
# Configure for offline operation
backend_options = XBRLBackendOptions(
enable_local_fetch=True,
enable_remote_fetch=False,
taxonomy="path/to/taxonomy"
)
converter = DocumentConverter(
allowed_formats=[InputFormat.XML_XBRL],
format_options={
InputFormat.XML_XBRL: XBRLFormatOption(backend_options=backend_options)
}
)
result = converter.convert("path/to/financial_report.xml")
- Notes:
- For completely offline parsing, a taxonomy package (ZIP file containing URL remappings) is required in addition to local schema files
- If no taxonomy package is provided, set
enable_remote_fetch=Trueto fetch remote taxonomy files (cached locally for reuse) - The XBRL backend transforms raw XBRL instances into structured relational graphs within the DoclingDocument using GraphCell and GraphLink structures, preserving concept hierarchies and linkbase relationships
- See XBRL conversion example for a complete end-to-end workflow with SEC EDGAR filings
Audio and Video Files
📖 For comprehensive documentation including installation instructions, RAG pipeline use cases, model customization, detailed limitations, and additional examples, see the Audio & video processing guide.
Docling's ASR (Automatic Speech Recognition) pipeline transcribes audio and video files into structured documents using Whisper models. Video files have their audio track automatically extracted before transcription. On Apple Silicon, mlx-whisper is used for optimized local inference; on other hardware, native Whisper is used.
- Supported Formats:
- Audio: WAV, MP3, M4A, AAC, OGG, FLAC
- Video: MP4, AVI, MOV (audio track extracted automatically)
- Installation: ASR is an optional extra. Install with
pip install "docling[asr]". Some formats (M4A, AAC, OGG, FLAC) and all video formats requireffmpegto be installed and available on your PATH. - Pipeline/Backend:
AsrPipeline - Key Options (
AsrPipelineOptions):asr_options: Specifies the ASR model (e.g.,asr_model_specs.WHISPER_TURBO)
- Usage Example:
from pathlib import Path
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline
pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
converter = DocumentConverter(
format_options={
InputFormat.AUDIO: AudioFormatOption(
pipeline_cls=AsrPipeline,
pipeline_options=pipeline_options,
)
}
)
result = converter.convert(Path("recording.mp3"))
print(result.document.export_to_markdown())
- Output Format: Paragraph-level Markdown with timestamps per segment (e.g.,
[time: 0.0-4.0] Transcribed text here). Suitable for RAG pipelines, summarization, and search indexing.
For detailed documentation, including installation instructions (pip install "docling[asr]"), RAG pipeline examples, model customization, detailed limitations table with workarounds, and best practices, see the comprehensive Audio & video processing guide.
DocumentConverter Initialization Parameters
The DocumentConverter class supports several initialization parameters that control global conversion behavior:
allowed_formats: List of allowed input formats. By default, any format supported by Docling is allowed.format_options: Dictionary of format-specific options (e.g.,PdfPipelineOptions,AsrPipelineOptions). See format-specific sections above for details.progress_callback: Optional callback function that receives structured progress events during conversion, including:- Document start/complete events (
DocumentProgressEvent): Emitted when a document begins or finishes processing. Includes document name and page count (if available). - Pipeline phase transitions (
PhaseProgressEvent): Emitted when entering or completing a phase (BUILD, ASSEMBLE, ENRICH). - Individual page completions (
PageProgressEvent): Emitted when each page finishes processing. Includes current page number and total page count.
- Document start/complete events (
When no callback is provided (the default), no progress events are emitted and there is zero overhead.
Usage Example:
from docling.datamodel.progress_event import ProgressEvent
from docling.document_converter import DocumentConverter
def on_progress(event: ProgressEvent):
print(event.event_type, event.document_name)
converter = DocumentConverter(progress_callback=on_progress)
result = converter.convert(source="https://arxiv.org/pdf/2408.09869")
CLI Support: The CLI also supports progress tracking via the --progress flag:
docling --progress FILE
Additional Notes
- Only PDF supports image resolution adjustment (
images_scale). The default PDF backend is nowdocling_parse. - DOCX header/footer export is only available via Python API.
- PPTX/XLSX support enrichment options and pagination (slide/sheet level).
- Pipeline Option Overrides: For all formats, the Python API supports disabling enrichment-related
do_*flags at conversion time using theformat_optionsargument. Only disabling (True → False) is allowed; all other options must remain unchanged. See the PDF section above for details. - Model Inference Engines and Presets: New fields (
vlm_pipeline_preset,vlm_pipeline_custom_config,picture_description_preset,picture_description_custom_config,code_formula_preset,code_formula_custom_config) allow selection of model inference engines and presets for VLM, picture description, and code/formula extraction. The previous options (picture_description_local,picture_description_api,vlm_pipeline_model,vlm_pipeline_model_local,vlm_pipeline_model_api) are deprecated and should be replaced with the new fields.
VLM Engine Options
VllmVlmEngineOptions
The vLLM engine provides high-throughput serving for vision-language models (VLMs). Key configuration options include:
cudagraph_mode(VllmCudaGraphMode, default: PIECEWISE): Controls CUDA graph capture mode for the vLLM v1 engine. CUDA graphs reduce kernel-launch overhead by replaying a recorded sequence of CUDA operations instead of launching each kernel individually.
Available modes:
- NONE: Disable CUDA graphs entirely; everything runs in eager mode. Fastest startup, lowest steady-state throughput. Best for short-lived processes, notebooks, and debugging.
- FULL: Capture the entire forward pass as one monolithic CUDA graph. Maximum graph coverage but requires very static execution shapes; may fail with some models or dynamic workloads.
- PIECEWISE: Capture segments of the model (e.g., transformer blocks) as multiple smaller graphs between selected ops. Handles dynamic shapes better than FULL while still accelerating most of the forward pass. (This is the default)
- FULL_AND_PIECEWISE: Hybrid mode - FULL graphs for decode-only batches; PIECEWISE graphs for prefill and mixed prefill+decode batches. Usually the best throughput option for typical LLM serving workloads.
- FULL_DECODE_ONLY: FULL CUDA graphs only for decode batches; prefill and mixed batches run in eager mode. Dramatically reduces graph-capture time and memory footprint compared to FULL_AND_PIECEWISE while still accelerating token generation.
KServe v2 API Engine Options
The KServe v2 API engines (ApiKserveV2ImageClassificationEngine and ApiKserveV2ObjectDetectionEngine) support both HTTP and gRPC transports. gRPC is the default transport and is more efficient than HTTP for binary tensor payloads.
ApiKserveV2ImageClassificationEngineOptions
url(str): Endpoint URL for KServe v2 transport. Fortransport='http', usehttp(s)://host[:port]or plainhost:port. Fortransport='grpc', use plainhost:port.model_name(str): Name of the model to invoke on the KServe v2 server.version(Optional[str]): Optional model version. If omitted, the server default is used.transport(Literal["grpc", "http"], default="grpc"): Transport protocol for KServe v2 calls. Use'grpc'for binary tensor payloads (default), or'http'for JSON REST.headers(Dict[str, str], default={}): Optional HTTP headers for authentication/routing whentransport='http'.grpc_metadata(Dict[str, str], default={}): Optional gRPC metadata for authentication/routing whentransport='grpc'. No HTTP headers are reused in gRPC mode.grpc_use_tls(bool, default=False): Whether to use TLS for the gRPC channel. When omitted, plain-text h2c is used.grpc_max_message_bytes(int, default=67108864): Max send/receive gRPC message size in bytes (default is 64MB).grpc_use_binary_data(bool, default=True): Whether to request/expect binary tensor payloads on gRPC output tensors. Set to False for servers that do not support binary_data output parameters.timeout(float, default=60.0): Per-request timeout in seconds for both HTTP and gRPC calls.request_parameters(Dict[str, Any], default={}): Optional additional parameters to include in the inference request.
ApiKserveV2ObjectDetectionEngineOptions
url(str): Endpoint URL for KServe v2 transport. Fortransport='http', usehttp(s)://host[:port]or plainhost:port. Fortransport='grpc', use plainhost:port.model_name(str): Name of the model to invoke on the KServe v2 server.version(Optional[str]): Optional model version. If omitted, the server default is used.transport(Literal["grpc", "http"], default="grpc"): Transport protocol for KServe v2 calls. Use'grpc'for binary tensor payloads (default), or'http'for JSON REST.headers(Dict[str, str], default={}): Optional HTTP headers for authentication/routing whentransport='http'.grpc_metadata(Dict[str, str], default={}): Optional gRPC metadata for authentication/routing whentransport='grpc'. No HTTP headers are reused in gRPC mode.grpc_use_tls(bool, default=False): Whether to use TLS for the gRPC channel. When omitted, plain-text h2c is used.grpc_max_message_bytes(int, default=67108864): Max send/receive gRPC message size in bytes (default is 64MB).grpc_use_binary_data(bool, default=True): Whether to request/expect binary tensor payloads on gRPC output tensors. Set to False for servers that do not support binary_data output parameters.timeout(float, default=60.0): Per-request timeout in seconds for both HTTP and gRPC calls.request_parameters(Dict[str, Any], default={}): Optional additional parameters to include in the inference request.
Notes on KServe v2 Transport
- gRPC is the default transport method for KServe v2 API engines, providing more efficient binary tensor payloads compared to HTTP's JSON encoding.
- When using
transport='grpc', use thegrpc_metadataparameter instead ofheadersfor authentication/routing. HTTP headers are not reused in gRPC mode. - The
urlformat differs by transport: for HTTP, include the protocol (e.g.,http://localhost:8000); for gRPC, use plainhost:port(e.g.,localhost:8000).
Sources
- https://app.dosu.dev/review/523ac0cc-93d5-4d40-a183-11e804b1654a