What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

PDF#

Pipeline/Backend: StandardPdfPipeline + DoclingParseDocumentBackend (default: docling_parse)
Key Options:
- from_formats: Supported input formats include docx, pptx, html, image, pdf, asciidoc, md (including txt, text, qmd, rmd), csv, xlsx, xml_uspto, xml_jats, xml_xbrl, xml_doclang, mets_gbs, json_docling, audio, vtt, latex
- to_formats: Supported output formats include md, json, yaml, html, html_split_page, text, doctags, vtt
- pdf_backend: Allowed values: pypdfium2, docling_parse, threaded_docling_parse, dlparse_v1, dlparse_v2, dlparse_v4 (default: docling_parse)
  - threaded_docling_parse: Threaded Docling Parse backend optimized for concurrent page parsing in the standard PDF pipeline
- do_ocr (default True): Use OCR
- force_ocr: Replace existing text with OCR-generated text
- ocr_engine, ocr_lang: OCR engine and language options
- image_export_mode: placeholder, embedded, referenced
- do_table_structure, table_mode, table_cell_matching: Table extraction options (see Table Structure Models section below for details on TableFormer V1 and V2)
- do_code_enrichment, do_formula_enrichment: Code/formula recognition
- chart_extraction_model (Optional[ChartExtractionModelKind], default None): Select the chart extraction model. Set to ChartExtractionModelKind.GRANITE_VISION ("granite-vision") to use the original Granite Vision model (extracts bar, pie, and line chart data into tabular format), or ChartExtractionModelKind.GRANITE_VISION_V4 ("granite-vision-v4") to use the Granite Vision 4.0 model (ibm-granite/granite-4.0-3b-vision; runs chart2csv, chart2code, and chart2summary prompts). Set to None (default) to disable chart extraction.
- vlm_pipeline_preset, vlm_pipeline_custom_config, picture_description_preset, picture_description_custom_config, code_formula_preset, code_formula_custom_config: New model inference engine and preset options for VLM, picture description, and code/formula extraction
- Deprecated: picture_description_local, picture_description_api, vlm_pipeline_model, vlm_pipeline_model_local, vlm_pipeline_model_api (use new preset/custom config options instead)
- Picture Description Filtering (for both PictureDescriptionAPI and PictureDescriptionLocal):
  - classification_allow (List[PictureClassificationLabel] or NoneType): Only describe pictures whose predicted class is in this allow-list
  - classification_deny (List[PictureClassificationLabel] or NoneType): Do not describe pictures whose predicted class is in this deny-list
  - classification_min_confidence (float): Minimum classification confidence required before a picture can be described
- images_scale: Image resolution multiplier (default 1.0; CLI default 2.0; values >2.0 may cause bugs)
- generate_page_images, generate_picture_images: Extract page/picture images
- force_backend_text: Force backend text extraction
- layout_custom_config, table_structure_custom_config: Custom model configs for layout/table structure (see Table Structure Models section below)
- Additional options for picture description and more

PDF Backend Options#

ThreadedDoclingParseBackendOptions#

Options specific to the threaded docling-parse backend:

Configuration Class: ThreadedDoclingParseBackendOptions
Kind: "threaded-docling-parse"
Key Options:
- parser_threads (Optional[PositiveInt], default: None): Number of parser threads to use for the threaded docling-parse backend. If unset, the backend falls back to global accelerator thread settings.
- release_native_memory_every_n_pages (integer >= 0, default: 128): Release native parser memory after every N decoded pages in the threaded docling-parse backend. Set to 0 to disable native-memory release.
- password (Optional[SecretStr]): Password for encrypted PDFs (inherited from PdfBackendOptions)
- enforce_same_font (bool, default: True): Whether docling-parse should split text cells at font boundaries. Disable this when PDFs use separate fonts for base glyphs and diacritics that should remain in the same text cell (inherited from PdfBackendOptions)

Table Structure Models#

Docling supports multiple models for table structure recognition. All models extract table structure from PDF documents but differ in their approach and configuration.

Granite Vision Table Structure#

VLM-based table structure extraction using IBM Granite Vision model with OTSL (Open Table Structure Language) output:

Configuration Class: GraniteVisionTableStructureOptions
Kind: "granite_vision_table"
Model: ibm-granite/granite-4.0-3b-vision
Approach: Uses a Vision-Language Model (VLM) approach instead of object detection to generate table structure in OTSL format
Installation: Requires VLM support: pip install docling[vlm]
Model Download: The model is downloaded automatically from HuggingFace on first use
Hardware: A CUDA GPU is recommended for performance (CPU works but is slower)
Example: docs/examples/granite_vision_table_structure.py

Usage Example:

from docling.datamodel.pipeline_options import GraniteVisionTableStructureOptions, PdfPipelineOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = GraniteVisionTableStructureOptions()

TableFormer V1#

The original TableFormer model, configured using TableStructureOptions:

Configuration Class: TableStructureOptions
Kind: "docling_tableformer"
Key Options:
- do_cell_matching: Controls cell matching behavior. The default is complex (depends on table_mode and backend):
  - When True: Matches predictions back to PDF cells using a separate cell matching pipeline
  - When False: Let table structure model define the text cells, ignore PDF cells
- table_mode: Controls accuracy vs. speed tradeoff
Text Extraction: Has a separate cell matching pipeline that can match predictions back to PDF cells
CLI Model Download: docling models download --model tableformer

TableFormer V2#

The newer TableFormer model with improved performance and simplified configuration:

Configuration Class: TableStructureV2Options
Kind: "docling_tableformer_v2"
Key Options:
- do_cell_matching (default: True): Controls cell matching behavior
  - When True: Matches predictions back to PDF cells. Can break table output if PDF cells are merged across table columns.
  - When False: Let table structure model define the text cells, ignore PDF cells.
Text Extraction:
- When do_cell_matching=True: Prefers text from cluster cells (which includes OCR-assigned cells), falling back to PDF backend text extraction if cluster cell text is empty
- When do_cell_matching=False: Extracts text directly from the PDF backend
CLI Model Download: docling models download --model tableformerv2

Usage Note: The table_structure_custom_config option in PdfPipelineOptions can be used to specify custom model configurations for any of the supported table structure models.

PDF (continued)#

Pipeline Option Overrides: The Python API allows you to override pipeline options at conversion time for a given format using the format_options argument. Only do_* flags (such as do_ocr, do_table_structure, do_code_enrichment, do_formula_enrichment, etc.) can be changed, and only from True to False. All other options must remain identical to those used at pipeline initialization. Attempting to enable a do_* flag or change other fields will result in an error. This enables per-call disabling of enrichment features without reinitializing the pipeline.
Exporting Scanned/Image-Based PDFs: When processing scanned or image-based PDFs with force_full_page_ocr=True, the layout model classifies full-page scans as PictureItem and OCR text is stored as children of those picture nodes. To export this OCR text via export_to_markdown() or export_to_text(), you must set the traverse_pictures=True parameter. Without this parameter, export functions will return empty results even though OCR text exists in the document.

# Required for scanned/image-based PDFs processed with full-page OCR
result = converter.convert(source="scanned.pdf")
text = result.document.export_to_text(traverse_pictures=True)
markdown = result.document.export_to_markdown(traverse_pictures=True)

Notes: Only PDF supports image resolution adjustment. For more details, see pipeline options code and example. Refer to the Python SDK documentation for usage of format_options. See the API reference for details on new preset/custom config fields and deprecated options.

DOCX#

Pipeline/Backend: SimplePipeline + MsWordDocumentBackend
Key Options:
- Enrichment options (code, formula, chart_extraction_model, image description)
- Header/Footer Export: Only supported via Python API by setting included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE}; default export excludes header/footer
Processing:
- Checkboxes: DOCX checkboxes are automatically detected and parsed using Word 2010+ XML elements (w14). Checkbox text is labeled as CHECKBOX_SELECTED (for checked checkboxes) or CHECKBOX_UNSELECTED (for unchecked checkboxes). Common checkbox symbols (☐, ☑, ☒, etc.) are automatically removed from the text content. This feature works automatically without any configuration required.
- Multiple Equations in Paragraphs: When a DOCX paragraph contains multiple sibling OMML equations (e.g., multiple <m:oMath> elements), each equation is extracted as a separate FORMULA item in the document structure. This applies to both:
  - Standalone equation paragraphs: Paragraphs containing only equations (no surrounding text) produce multiple separate FORMULA items, one for each equation
  - Inline equations: Multiple equations within text-containing paragraphs are preserved as distinct formula items
- Previously, multiple sibling equations in a single paragraph were concatenated into a single LaTeX string, but this has been fixed to maintain each equation as a separate document item
- Inline Equations in List Items: Inline formulas appearing in list items (both bulleted and numbered) are correctly processed and preserved in markdown exports. When a list item contains inline equations, they are exported with LaTeX $ delimiters alongside the surrounding text
Notes: Header/footer are automatically detected as FURNITURE layer. CLI/Serve API exports only BODY. Example.

PPTX#

Pipeline/Backend: SimplePipeline + MsPowerpointDocumentBackend
Key Options:
- ConvertPipelineOptions (enrichment: image classification/description, chart_extraction_model)
- PaginatedPipelineOptions (image scaling, page image generation)
Processing:
- Each slide is treated as a page
- Extracts text (paragraphs, lists, indentation, master styles), images (using PIL), tables (cell/span/header), slide notes (notes)
- Tables and images include provenance (location info)
Notes: Image resolution adjustment is not supported (depends on backend quality). Pipeline code reference.

XLSX#

Pipeline/Backend: SimplePipeline + MsExcelDocumentBackend
Key Options:
- treat_singleton_as_text (default False): Treat 1x1 cells as TextItem
- gap_tolerance (default 0): Table merging tolerance for empty cells
- sheet_names (default None): List of sheet names to include in conversion; only sheets in this list will be processed. Sheet names are matched case-sensitively. Set to None (default) to include all sheets.
- Enrichment options (image description, chart_extraction_model)
Processing:
- Each sheet is treated as a page
- Table detection via flood-fill, image extraction (bounding box based on cell anchor)
- Includes provenance (location info), auto page size calculation
Usage Example:

from docling.datamodel.backend_options import MsExcelBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, XlsxFormatOption

# Filter specific sheets by name
backend_options = MsExcelBackendOptions(
    sheet_names=["Summary", "Q4 Results"]
)

converter = DocumentConverter(
    format_options={
        InputFormat.XLSX: XlsxFormatOption(backend_options=backend_options)
    }
)

result = converter.convert("quarterly_report.xlsx")

Notes: Table detection algorithm and singleton cell handling are configurable. Backend options code.

Markdown#

Pipeline/Backend: SimplePipeline + MarkdownDocumentBackend
Supported File Extensions: .md (Markdown), .txt and .text (plain-text), .qmd (Quarto Markdown), .rmd and .Rmd (R Markdown)
MIME Type Support: text/markdown, text/x-markdown, text/plain
Processing:
- Parses Markdown syntax to extract structured content (headings, paragraphs, lists, code blocks, etc.)
- Plain-text files (.txt, .text) and files with text/plain MIME type are processed through the Markdown backend
- Markdown supersets (Quarto and R Markdown) are supported; the backend handles prose and heading structure, while language-specific code chunk metadata (e.g., {r}, {python}) is passed through as fenced code blocks
Notes: USPTO patent files distributed as .txt files (APS format, identified by a PATN\r\n prefix) are detected and routed to the USPTO XML backend instead of the Markdown backend.

HTML#

Pipeline/Backend: SimplePipeline + HTMLDocumentBackend
Installation Requirements: HTML rendering (with headless browser support) requires the htmlrender extra: pip install docling[htmlrender]. This installs Playwright and related dependencies.
Key Options (HTMLBackendOptions):
- render_page (bool, default: False): Enable headless browser rendering to capture page images and element bounding boxes
- render_page_width (int, default: 794): Render page width in CSS pixels (A4 @ 96 DPI)
- render_page_height (int, default: 1123): Render page height in CSS pixels (A4 @ 96 DPI)
- render_page_orientation (Literal["portrait", "landscape"], default: "portrait"): Page orientation
- render_print_media (bool, default: True): Use print media emulation when rendering
- render_wait_until (Literal["load", "domcontentloaded", "networkidle"], default: "networkidle"): Playwright wait condition before extracting DOM
- render_wait_ms (int, default: 0): Extra delay in milliseconds after load
- render_device_scale (float, default: 1.0): Device scale factor for rendering
- page_padding (int, default: 0): Padding in CSS pixels applied to HTML body before rendering
- render_full_page (bool, default: False): Capture a single full-height page image instead of paginating
- render_dpi (int, default: 96): DPI used for page images created from rendering
- fetch_images (bool, default: False): Fetch and embed images from the HTML
- enable_local_fetch (bool): Enable fetching resources from the local filesystem
- enable_remote_fetch (bool, default: False): Enable fetching remote resources during rendering. When disabled, the browser runs in offline mode and blocks requests to remote resources, allowing only local schemes (file://, data://, about://, blob://)
  - max_image_data_base64_bytes (PositiveInt, default: 20 MB): The maximum number of base64 data bytes that the backend will accept for inline images
  - max_remote_image_bytes (PositiveInt, default: 20 MB): The maximum number of bytes for remote image downloads when fetching images from URLs
  - headers (dict[str, str] | None, default: None): Custom HTTP headers to include when fetching remote images in HTML documents. Enables authentication and API key usage for protected image resources.
- source_uri (Path or str): Base URI for resolving relative paths in HTML
- CLI Flags for HTML Image Fetching:
  - --html-image-fetch: Controls which image resources to fetch (default: none)
    - none: Don't fetch any images from HTML
    - local: Fetch only local file images
    - remote: Fetch only remote (HTTP/HTTPS) images
    - all: Fetch both local and remote images
  - --html-image-headers: Specify HTTP request headers for fetching remote HTML images (JSON string format)
    - This is separate from --headers (used for source document fetching)
    - Only used when --html-image-fetch is set to remote or all
    - Allows separate authorization for image URLs vs. source document URLs
- CLI Usage Examples:
```
# Fetch local images only
docling convert document.html --html-image-fetch local

# Fetch remote images with custom headers
docling convert https://example.com/page.html --html-image-fetch remote --html-image-headers '{"Authorization": "Bearer token123"}'

# Fetch all images (local and remote)
docling convert document.html --html-image-fetch all
```
Processing:
- Reading order is preserved from the HTML DOM tree
- Supports HTML form elements: checkboxes, radio buttons, text inputs, and other input fields
- Supports key-value pair conventions where HTML elements with matching IDs (e.g., "key1" and "key1_value1") are automatically paired as key-value relationships
- When render_page=True, uses Playwright headless browser to materialize HTML pages into images
- JavaScript execution is disabled (java_script_enabled=False) for deterministic static rendering, ensuring consistent output regardless of dynamic page behavior
- Stricter resource loading policy: When enable_remote_fetch=False, remote requests are blocked with a warning message. Only local schemes (file://, data://, about://, blob://) are allowed by default
- Adds provenances with bounding boxes to all elements in the converted document when rendering is enabled
- Can handle local file paths and remote URLs
- Heuristic that glues independent inline HTML elements with single-character text into larger text blocks
- Support for inline styling (bold, italic, etc.)
  - Relative image paths are resolved per input document when using the CLI, ensuring that each HTML source has its own URI context
  - Directory expansion handles mixed-case HTML extensions (.html, .HTML, .Html, etc.)
- Notes: When using the Python API, the CLI flags (--html-image-fetch and --html-image-headers) map to the HTMLBackendOptions fields: fetch_images, enable_local_fetch, enable_remote_fetch, and headers. These CLI flags apply to both HTML and EPUB formats.
Usage Example:

from pathlib import Path
from docling.datamodel.backend_options import HTMLBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, HTMLFormatOption

html_options = HTMLBackendOptions(
    render_page=True,
    render_page_width=794,
    render_page_height=1123,
    render_device_scale=2.0,
    render_page_orientation="portrait",
    render_print_media=True,
    render_wait_until="networkidle",
    render_wait_ms=500,
    render_full_page=True,
    render_dpi=144,
    page_padding=16,
    fetch_images=True,
)

converter = DocumentConverter(
    format_options={
        InputFormat.HTML: HTMLFormatOption(backend_options=html_options)
    }
)

result = converter.convert("path/to/file.html")
doc = result.document

EPUB Format#

EPUB (Electronic Publication) is a widely-used e-book format. EPUB files are ZIP archives containing XHTML content files, metadata, and resources.

Pipeline/Backend: SimplePipeline + EpubDocumentBackend
Backend: The EpubDocumentBackend parses the EPUB structure (OPF metadata, spine, manifest), extracts XHTML content files in reading order, and delegates XHTML processing to the HTMLDocumentBackend for consistency.
Key Options (EpubBackendOptions):
- fetch_images (bool, default: False): Whether to fetch and process images from the EPUB.
- enable_local_fetch (bool): Enable fetching resources from the local filesystem
- enable_remote_fetch (bool): Enable fetching remote resources
- max_total_bytes (default: 100 MB): Maximum cumulative size in bytes of all data extracted from the EPUB archive during processing
- max_file_bytes (default: 10 MB): Maximum size in bytes for any single file extracted from the EPUB archive
- max_member_count (default: 1000): Maximum number of archive members to process
Security: Size and count limits prevent decompression bombs and malicious EPUB files from exhausting disk space or memory.
Limitation: Internal anchor links (e.g., footnote references) are converted but the target anchor IDs may not be preserved in the final Markdown output. Links like [1](#note-1) will be present, but the corresponding anchor targets may not be accessible in the exported Markdown.
CLI: The --html-image-fetch and --html-image-headers flags apply to EPUB files as well as HTML files.

Usage Example:

from docling.datamodel.backend_options import EpubBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, EpubFormatOption

epub_options = EpubBackendOptions(
    fetch_images=True,
    enable_local_fetch=True,
    enable_remote_fetch=False,
)

converter = DocumentConverter(
    format_options={
        InputFormat.EPUB: EpubFormatOption(backend_options=epub_options)
    }
)

result = converter.convert("path/to/ebook.epub")
doc = result.document

LaTeX#

Pipeline/Backend: SimplePipeline + LatexDocumentBackend
Key Options (LatexBackendOptions):
- parse_timeout (default: 30.0 seconds): Maximum time allowed for parsing a LaTeX document. Set to None to disable the timeout. This prevents pylatexenc from spinning indefinitely when parsing legacy arXiv documents with complex or malformed macroscopic environments. If parsing exceeds this timeout, the conversion will fall back to raw text extraction rather than structured parsing. A warning will be logged when a timeout occurs.
- tikz_engine (Optional[Literal["tectonic"]]): The engine to use for rendering TikZ diagrams into images. Set to 'tectonic' to enable asynchronous image generation. Defaults to None.
- tikz_engine_timeout (float, default: 60.0): The timeout in seconds for rendering a single TikZ diagram.
- tikz_engine_allow_shell_escape (bool, default: False): Allow Tectonic TikZ rendering to enable shell escape during compilation. Disabled by default for safer rendering of untrusted LaTeX.
Processing:
- Parses LaTeX source using pylatexenc to extract structured content (sections, equations, tables, etc.)
- Pre-processes custom macros (e.g., \be/\ee shortcuts for equations)
- Timeout enforcement runs parsing in a daemon thread to allow graceful fallback on timeout
- TikZ Rendering: When tikz_engine is set to 'tectonic', the backend detects tikzpicture environments and renders them asynchronously into images. When Tectonic compilation succeeds, the TikZ diagram is rasterized and stored as an image. When compilation fails, times out, produces no PDF, or rasterization fails, Docling preserves the original TikZ source as fallback code metadata.
Notes: The parse_timeout option is particularly useful for processing legacy arXiv documents that may contain complex or malformed macro environments. CLI flags are available for TikZ rendering: --tikz-engine / -T, --tikz-engine-timeout, and --tikz-shell-escape. To configure the timeout:

from docling.datamodel.backend_options import LatexBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, LatexFormatOption

# Increase timeout to 60 seconds
latex_options = LatexBackendOptions(
    parse_timeout=60.0
)

converter = DocumentConverter(
    format_options={
        InputFormat.LATEX: LatexFormatOption(backend_options=latex_options)
    }
)

# Or disable timeout entirely
latex_options = LatexBackendOptions(
    parse_timeout=None
)

converter = DocumentConverter(
    format_options={
        InputFormat.LATEX: LatexFormatOption(backend_options=latex_options)
    }
)

EMAIL#

Pipeline/Backend: SimplePipeline + EmailDocumentBackend
Supported File Extensions: .eml (EML email format)
MIME Type: message/rfc822
Key Options: No backend-specific options available
Processing:
- Parses email files in EML format to extract metadata and content
- Extracts email metadata: subject, from, to, date
- For plain text emails, splits content into paragraphs (delimited by blank lines)
- For HTML emails, converts HTML content to markdown using HTMLDocumentBackend internally
- Creates a DoclingDocument with the subject as title, metadata fields as text items, and body paragraphs as text items
- Attachments are ignored; only email metadata and body text are processed
Usage Example:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("path/to/email.eml")

Notes: Email format does not support pagination. Email parsing is performed using the mailparser library.

XBRL#

XBRL (eXtensible Business Reporting Language) is a standard XML-based format used globally by companies, regulators, and financial institutions for exchanging business and financial information in a structured, machine-readable format. It's widely adopted for regulatory filings (e.g., SEC filings in the US).

Pipeline/Backend: SimplePipeline + XBRLDocumentBackend
Key Options (XBRLBackendOptions):
- enable_local_fetch (default: True): Enable fetching taxonomy files from the local filesystem
- enable_remote_fetch (default: True): Enable fetching taxonomy files from remote URLs
- taxonomy: Path to local taxonomy directory containing schema and linkbase files
Processing:
- Parses XBRL instance documents and validates against taxonomy
- Extracts metadata, text blocks, and numeric facts with comprehensive enrichment
- Converts HTML text blocks to structured content
- Numeric facts are extracted as key-value pairs with graph representation, including:
  - Period: Distinguishes between instant (point-in-time) and duration (start-end) data
  - Unit/Currency: Captures the measurement unit (e.g., USD, shares)
  - Decimals: Captures decimal precision information
  - Dimensions: Captures dimensional context (e.g., geographical segments, product lines)
- Presentation Linkbase: Captures parent-child relationships that define the hierarchical structure of concepts in the taxonomy
- Calculation Linkbase: Captures summation relationships between concepts, including weights that indicate how child items contribute to parent totals
- Supports offline parsing with local taxonomy packages
Configuration Example:

from docling.datamodel.backend_options import XBRLBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, XBRLFormatOption

# Configure for offline operation
backend_options = XBRLBackendOptions(
    enable_local_fetch=True,
    enable_remote_fetch=False,
    taxonomy="path/to/taxonomy"
)

converter = DocumentConverter(
    allowed_formats=[InputFormat.XML_XBRL],
    format_options={
        InputFormat.XML_XBRL: XBRLFormatOption(backend_options=backend_options)
    }
)

result = converter.convert("path/to/financial_report.xml")

Notes:
- For completely offline parsing, a taxonomy package (ZIP file containing URL remappings) is required in addition to local schema files
- If no taxonomy package is provided, set enable_remote_fetch=True to fetch remote taxonomy files (cached locally for reuse)
- The XBRL backend transforms raw XBRL instances into structured relational graphs within the DoclingDocument using GraphCell and GraphLink structures, preserving concept hierarchies and linkbase relationships
- See XBRL conversion example for a complete end-to-end workflow with SEC EDGAR filings

Audio and Video Files#

📖 For comprehensive documentation including installation instructions, RAG pipeline use cases, model customization, detailed limitations, and additional examples, see the Audio & video processing guide.

Docling's ASR (Automatic Speech Recognition) pipeline transcribes audio and video files into structured documents using Whisper models. Video files have their audio track automatically extracted before transcription. On Apple Silicon, mlx-whisper is used for optimized local inference; on other hardware, native Whisper is used.

Supported Formats:
- Audio: WAV, MP3, M4A, AAC, OGG, FLAC
- Video: MP4, AVI, MOV (audio track extracted automatically)
Installation: ASR is an optional extra. Install with pip install "docling[asr]". Whisper audio decoding requires the ffmpeg executable to be installed and available on your PATH. This applies to all audio formats (WAV, MP3, M4A, AAC, OGG, FLAC) and video files. Install it with your system package manager — e.g. brew install ffmpeg on macOS, apt-get install ffmpeg on Debian-based Linux, or winget install ffmpeg on Windows.
Pipeline/Backend: AsrPipeline
Key Options (AsrPipelineOptions):
- asr_options: Specifies the ASR model (e.g., asr_model_specs.WHISPER_TURBO)
Usage Example:

from pathlib import Path
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline

pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(
            pipeline_cls=AsrPipeline,
            pipeline_options=pipeline_options,
        )
    }
)

result = converter.convert(Path("recording.mp3"))
print(result.document.export_to_markdown())

Output Format: Paragraph-level Markdown with timestamps per segment (e.g., [time: 0.0-4.0] Transcribed text here). Suitable for RAG pipelines, summarization, and search indexing.

For detailed documentation, including installation instructions (pip install "docling[asr]"), RAG pipeline examples, model customization, detailed limitations table with workarounds, and best practices, see the comprehensive Audio & video processing guide.

DocumentConverter Initialization Parameters#

The DocumentConverter class supports several initialization parameters that control global conversion behavior:

allowed_formats: List of allowed input formats. By default, any format supported by Docling is allowed.
format_options: Dictionary of format-specific options (e.g., PdfPipelineOptions, AsrPipelineOptions). See format-specific sections above for details.
progress_callback: Optional callback function that receives structured progress events during conversion, including:
- Document start/complete events (DocumentProgressEvent): Emitted when a document begins or finishes processing. Includes document name and page count (if available).
- Pipeline phase transitions (PhaseProgressEvent): Emitted when entering or completing a phase (BUILD, ASSEMBLE, ENRICH).
- Individual page completions (PageProgressEvent): Emitted when each page finishes processing. Includes current page number and total page count.

When no callback is provided (the default), no progress events are emitted and there is zero overhead.

Usage Example:

from docling.datamodel.progress_event import ProgressEvent
from docling.document_converter import DocumentConverter

def on_progress(event: ProgressEvent):
    print(event.event_type, event.document_name)

converter = DocumentConverter(progress_callback=on_progress)
result = converter.convert(source="https://arxiv.org/pdf/2408.09869")

CLI Support: The CLI also supports progress tracking via the --progress flag:

docling --progress FILE

CLI Flags: The CLI also supports the following flags for the threaded docling-parse backend:

--release-native-memory-every-n-pages (default: 128): Release native parser memory after every N decoded pages when using the threaded docling-parse backend.

Additional Notes#

Only PDF supports image resolution adjustment (images_scale). The default PDF backend is now docling_parse.
DOCX header/footer export is only available via Python API.
PPTX/XLSX support enrichment options and pagination (slide/sheet level).
Pipeline Option Overrides: For all formats, the Python API supports disabling enrichment-related do_* flags at conversion time using the format_options argument. Only disabling (True → False) is allowed; all other options must remain unchanged. See the PDF section above for details.
Model Inference Engines and Presets: New fields (vlm_pipeline_preset, vlm_pipeline_custom_config, picture_description_preset, picture_description_custom_config, code_formula_preset, code_formula_custom_config) allow selection of model inference engines and presets for VLM, picture description, and code/formula extraction. The previous options (picture_description_local, picture_description_api, vlm_pipeline_model, vlm_pipeline_model_local, vlm_pipeline_model_api) are deprecated and should be replaced with the new fields.

VLM Engine Options#

VllmVlmEngineOptions#

The vLLM engine provides high-throughput serving for vision-language models (VLMs). Key configuration options include:

model_impl (str, default: "auto"): Controls the vLLM model implementation backend. Accepted values depend on the installed vLLM version; common values are 'auto', 'vllm', and 'transformers'. 'auto' uses vLLM's native implementation when available and otherwise falls back to the Transformers modeling backend; 'vllm' forces the native implementation; 'transformers' forces the Transformers modeling backend.
cudagraph_mode (VllmCudaGraphMode, default: PIECEWISE): Controls CUDA graph capture mode for the vLLM v1 engine. CUDA graphs reduce kernel-launch overhead by replaying a recorded sequence of CUDA operations instead of launching each kernel individually.

Available modes:

NONE: Disable CUDA graphs entirely; everything runs in eager mode. Fastest startup, lowest steady-state throughput. Best for short-lived processes, notebooks, and debugging.
FULL: Capture the entire forward pass as one monolithic CUDA graph. Maximum graph coverage but requires very static execution shapes; may fail with some models or dynamic workloads.
PIECEWISE: Capture segments of the model (e.g., transformer blocks) as multiple smaller graphs between selected ops. Handles dynamic shapes better than FULL while still accelerating most of the forward pass. (This is the default)
FULL_AND_PIECEWISE: Hybrid mode - FULL graphs for decode-only batches; PIECEWISE graphs for prefill and mixed prefill+decode batches. Usually the best throughput option for typical LLM serving workloads.
FULL_DECODE_ONLY: FULL CUDA graphs only for decode batches; prefill and mixed batches run in eager mode. Dramatically reduces graph-capture time and memory footprint compared to FULL_AND_PIECEWISE while still accelerating token generation.

KServe v2 API Engine Options#

The KServe v2 API engines (ApiKserveV2ImageClassificationEngine, ApiKserveV2ObjectDetectionEngine, and KserveV2OcrModel) support both HTTP and gRPC transports. gRPC is the default transport and is more efficient than HTTP for binary tensor payloads.

ApiKserveV2ImageClassificationEngineOptions#

url (str): Endpoint URL for KServe v2 transport. For transport='http', use http(s)://host[:port] or plain host:port. For transport='grpc', use plain host:port or dns:///host:port (dns:/// enables gRPC-native DNS load balancing, e.g. round_robin over headless k8s services).
model_name (str): Name of the model to invoke on the KServe v2 server.
version (Optional[str]): Optional model version. If omitted, the server default is used.
transport (Literal["grpc", "http"], default="grpc"): Use 'grpc' or 'http' for KServe v2 inference. Both transports now support binary tensor payloads when use_binary_data is enabled.
headers (Dict[str, str], default={}): Optional HTTP headers for authentication/routing when transport='http'.
grpc_metadata (Dict[str, str], default={}): Optional gRPC metadata for authentication/routing when transport='grpc'. No HTTP headers are reused in gRPC mode.
grpc_use_tls (bool, default=False): Whether to use TLS for the gRPC channel. When omitted, plain-text h2c is used.
grpc_max_message_bytes (int, default=67108864): Max send/receive gRPC message size in bytes (default is 64MB).
grpc_channel_args (List[Tuple[str, Any]], default=[]): Extra gRPC channel args forwarded to the underlying channel. Use e.g. [('grpc.lb_policy_name', 'round_robin')] together with a dns:/// URL for client-side load balancing across k8s headless service endpoints. Only relevant when transport='grpc'.
use_binary_data (bool, default=True): Whether to request/expect binary tensor payloads in requests and responses. Supported by both HTTP and gRPC transports. Set to False for servers that do not support binary_data parameters.
grpc_use_binary_data (deprecated): Deprecated alias for use_binary_data. Kept for backward compatibility; use use_binary_data instead.
timeout (float, default=60.0): Per-request timeout in seconds for both HTTP and gRPC calls.
request_parameters (Dict[str, Any], default={}): Optional additional parameters to include in the inference request.

ApiKserveV2ObjectDetectionEngineOptions#

url (str): Endpoint URL for KServe v2 transport. For transport='http', use http(s)://host[:port] or plain host:port. For transport='grpc', use plain host:port or dns:///host:port (dns:/// enables gRPC-native DNS load balancing, e.g. round_robin over headless k8s services).
model_name (str): Name of the model to invoke on the KServe v2 server.
version (Optional[str]): Optional model version. If omitted, the server default is used.
transport (Literal["grpc", "http"], default="grpc"): Use 'grpc' or 'http' for KServe v2 inference. Both transports now support binary tensor payloads when use_binary_data is enabled.
headers (Dict[str, str], default={}): Optional HTTP headers for authentication/routing when transport='http'.
grpc_metadata (Dict[str, str], default={}): Optional gRPC metadata for authentication/routing when transport='grpc'. No HTTP headers are reused in gRPC mode.
grpc_use_tls (bool, default=False): Whether to use TLS for the gRPC channel. When omitted, plain-text h2c is used.
grpc_max_message_bytes (int, default=67108864): Max send/receive gRPC message size in bytes (default is 64MB).
grpc_channel_args (List[Tuple[str, Any]], default=[]): Extra gRPC channel args forwarded to the underlying channel. Use e.g. [('grpc.lb_policy_name', 'round_robin')] together with a dns:/// URL for client-side load balancing across k8s headless service endpoints. Only relevant when transport='grpc'.
use_binary_data (bool, default=True): Whether to request/expect binary tensor payloads in requests and responses. Supported by both HTTP and gRPC transports. Set to False for servers that do not support binary_data parameters.
grpc_use_binary_data (deprecated): Deprecated alias for use_binary_data. Kept for backward compatibility; use use_binary_data instead.
timeout (float, default=60.0): Per-request timeout in seconds for both HTTP and gRPC calls.
request_parameters (Dict[str, Any], default={}): Optional additional parameters to include in the inference request.

KserveV2OcrOptions#

KserveV2OcrOptions (kind: "kserve_v2_ocr") configures a remote KServe v2-compatible inference server (e.g., NVIDIA Triton) for OCR via gRPC or HTTP. No local model download is required.

url (str, required): Endpoint URL — use host:port or dns:///host:port for gRPC (dns:/// enables gRPC-native DNS load balancing, e.g. round_robin over headless k8s services), http(s)://host:port for HTTP.
model_name (str, default="ocr"): Remote model name registered on the server.
lang (list[str], default=["english", "chinese"]): OCR languages; passed to the server — support depends on the deployed model.
scale (float, default=2.0): Image scale multiplier for OCR resolution (e.g., 2.0 converts 72 DPI → 144 DPI).
transport (Literal["grpc", "http"], default="grpc"): Use 'grpc' or 'http' for KServe v2 inference. Both transports now support binary tensor payloads when use_binary_data is enabled.
grpc_use_tls (bool, default=False): Enable TLS for the gRPC channel. When omitted, plain-text h2c is used.
grpc_metadata (Dict[str, str], default={}): gRPC metadata for authentication/routing when transport='grpc'.
headers (Dict[str, str], default={}): HTTP headers for authentication/routing when transport='http'.
timeout (float, default=60.0): Per-request timeout in seconds.
model_version (Optional[str]): Optional model version; uses server default if omitted.
grpc_max_message_bytes (int, default=67108864): Max send/receive gRPC message size in bytes (default is 64MB).
grpc_channel_args (List[Tuple[str, Any]], default=[]): Extra gRPC channel args forwarded to the underlying channel. Use e.g. [('grpc.lb_policy_name', 'round_robin')] together with a dns:/// URL for client-side load balancing across k8s headless service endpoints. Only relevant when transport='grpc'.
use_binary_data (bool, default=True): Whether to request/expect binary tensor payloads in requests and responses. Supported by both HTTP and gRPC transports. Set to False for servers that do not support binary_data parameters.
grpc_use_binary_data (deprecated): Deprecated alias for use_binary_data. Kept for backward compatibility; use use_binary_data instead.
request_parameters (Dict[str, Any], default={}): Optional top-level KServe v2 infer request parameters.

Usage Example:

from docling.datamodel.pipeline_options import PdfPipelineOptions, KserveV2OcrOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = KserveV2OcrOptions(
    url="localhost:8001", # gRPC endpoint
    model_name="ocr",
    transport="grpc",
    lang=["english"],
    scale=2.0,
)

Notes on KServe v2 Transport#

gRPC is the default transport method for KServe v2 API engines, providing more efficient binary tensor payloads compared to HTTP's JSON encoding.
When using transport='grpc', use the grpc_metadata parameter instead of headers for authentication/routing. HTTP headers are not reused in gRPC mode.
The url format differs by transport: for HTTP, include the protocol (e.g., http://localhost:8000); for gRPC, use plain host:port (e.g., localhost:8000) or dns:///host:port for DNS-based load balancing (e.g., dns:///my-service.default.svc.cluster.local:8001).
The dns:/// URL scheme enables gRPC-native DNS load balancing (e.g., round_robin) over headless Kubernetes services. Use grpc_channel_args to configure load balancing policies when needed.