What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

PDF


Table Structure Models

Docling supports two versions of the TableFormer model for table structure recognition. Both models extract table structure from PDF documents but differ in their configuration and text extraction behavior.

TableFormer V1

The original TableFormer model, configured using TableStructureOptions:

TableFormer V2

The newer TableFormer model with improved performance and simplified configuration:

Usage Note: The table_structure_custom_config option in PdfPipelineOptions can be used to specify custom model configurations for either TableFormer V1 or V2.


PDF (continued)

# Required for scanned/image-based PDFs processed with full-page OCR
result = converter.convert(source="scanned.pdf")
text = result.document.export_to_text(traverse_pictures=True)
markdown = result.document.export_to_markdown(traverse_pictures=True)

DOCX


PPTX


XLSX


Markdown


HTML

from pathlib import Path
from docling.datamodel.backend_options import HTMLBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, HTMLFormatOption

html_options = HTMLBackendOptions(
    render_page=True,
    render_page_width=794,
    render_page_height=1123,
    render_device_scale=2.0,
    render_page_orientation="portrait",
    render_print_media=True,
    render_wait_until="networkidle",
    render_wait_ms=500,
    render_full_page=True,
    render_dpi=144,
    page_padding=16,
    fetch_images=True,
)

converter = DocumentConverter(
    format_options={
        InputFormat.HTML: HTMLFormatOption(backend_options=html_options)
    }
)

result = converter.convert("path/to/file.html")
doc = result.document

LaTeX

from docling.datamodel.backend_options import LatexBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, LatexFormatOption

# Increase timeout to 60 seconds
latex_options = LatexBackendOptions(
    parse_timeout=60.0
)

converter = DocumentConverter(
    format_options={
        InputFormat.LATEX: LatexFormatOption(backend_options=latex_options)
    }
)

# Or disable timeout entirely
latex_options = LatexBackendOptions(
    parse_timeout=None
)

converter = DocumentConverter(
    format_options={
        InputFormat.LATEX: LatexFormatOption(backend_options=latex_options)
    }
)

XBRL

XBRL (eXtensible Business Reporting Language) is a standard XML-based format used globally by companies, regulators, and financial institutions for exchanging business and financial information in a structured, machine-readable format. It's widely adopted for regulatory filings (e.g., SEC filings in the US).

from docling.datamodel.backend_options import XBRLBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, XBRLFormatOption

# Configure for offline operation
backend_options = XBRLBackendOptions(
    enable_local_fetch=True,
    enable_remote_fetch=False,
    taxonomy="path/to/taxonomy"
)

converter = DocumentConverter(
    allowed_formats=[InputFormat.XML_XBRL],
    format_options={
        InputFormat.XML_XBRL: XBRLFormatOption(backend_options=backend_options)
    }
)

result = converter.convert("path/to/financial_report.xml")

Audio and Video Files

📖 For comprehensive documentation including installation instructions, RAG pipeline use cases, model customization, detailed limitations, and additional examples, see the Audio & video processing guide.

Docling's ASR (Automatic Speech Recognition) pipeline transcribes audio and video files into structured documents using Whisper models. Video files have their audio track automatically extracted before transcription. On Apple Silicon, mlx-whisper is used for optimized local inference; on other hardware, native Whisper is used.

from pathlib import Path
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline

pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(
            pipeline_cls=AsrPipeline,
            pipeline_options=pipeline_options,
        )
    }
)

result = converter.convert(Path("recording.mp3"))
print(result.document.export_to_markdown())

For detailed documentation, including installation instructions (pip install "docling[asr]"), RAG pipeline examples, model customization, detailed limitations table with workarounds, and best practices, see the comprehensive Audio & video processing guide.


DocumentConverter Initialization Parameters

The DocumentConverter class supports several initialization parameters that control global conversion behavior:

When no callback is provided (the default), no progress events are emitted and there is zero overhead.

Usage Example:

from docling.datamodel.progress_event import ProgressEvent
from docling.document_converter import DocumentConverter

def on_progress(event: ProgressEvent):
    print(event.event_type, event.document_name)

converter = DocumentConverter(progress_callback=on_progress)
result = converter.convert(source="https://arxiv.org/pdf/2408.09869")

CLI Support: The CLI also supports progress tracking via the --progress flag:

docling --progress FILE

Additional Notes


VLM Engine Options

VllmVlmEngineOptions

The vLLM engine provides high-throughput serving for vision-language models (VLMs). Key configuration options include:

Available modes:


KServe v2 API Engine Options

The KServe v2 API engines (ApiKserveV2ImageClassificationEngine and ApiKserveV2ObjectDetectionEngine) support both HTTP and gRPC transports. gRPC is the default transport and is more efficient than HTTP for binary tensor payloads.

ApiKserveV2ImageClassificationEngineOptions

ApiKserveV2ObjectDetectionEngineOptions

Notes on KServe v2 Transport

Sources