Docling contains the following pipelines, each with specific purposes and selection criteria:
Production Pipelines#
-
StandardPdfPipeline
- Purpose: Multi-threaded PDF processing with layout detection, OCR, and table structure extraction.
- Auto-Selected For: PDF, IMAGE, METS_GBS formats.
- Manual Only: No (default for relevant formats).
- Strengths: Always provides bounding boxes, well-tested, faster due to threading.
- Limitations: Traditional OCR may struggle with complex layouts.
- Scanned Documents: This is the default pipeline for scanned documents, using OCR (EasyOCR by default, with options for Tesseract or RapidOCR).
-
SimplePipeline
- Purpose: Direct conversion for structured formats.
- Auto-Selected For: DOCX, PPTX, XLSX, HTML, Markdown (.md), plain-text (.txt, .text), Quarto (.qmd), R Markdown (.Rmd), AsciiDoc, LaTeX, CSV, JSON_DOCLING, VTT, XML variants.
- Manual Only: No (default for these formats).
-
VlmPipeline
- Purpose: Vision-Language Model document understanding.
- Auto-Selected For: None.
- Manual Only: Yes (must be selected with
--pipeline vlm). - Strengths: Better semantic understanding of complex layouts, tables, and figures.
- Limitations: More computationally expensive; bounding boxes only available with DocTags presets.
- Scanned Documents: Can be used for scanned documents when semantic understanding is critical, but not auto-selected.
-
ExtractionVlmPipeline
- Purpose: Schema-based structured data extraction using NuExtract VLM.
- Auto-Selected For: None.
- Manual Only: Yes.
-
AsrPipeline
- Purpose: Speech-to-text transcription using Whisper models.
- Auto-Selected For: Audio files.
- Manual Only: No (default for audio).
Experimental Pipeline#
- ThreadedLayoutVlmPipeline
- Purpose: Hybrid layout model detection plus VLM processing with spatial context.
- Status: Experimental, not production ready.
Deprecated/Alias Pipelines#
- LegacyStandardPdfPipeline: Deprecated, replaced by StandardPdfPipeline.
- ThreadedStandardPdfPipeline: Alias for StandardPdfPipeline.
Scanned Documents Clarification#
- StandardPdfPipeline with OCR is the default and only auto-selected pipeline for scanned documents. It uses OCR to extract text and layout information.
- VlmPipeline can be manually selected for scanned documents when advanced semantic understanding is needed, but is not auto-selected.
- The previous mention of "another pipeline" for scanned documents referred to StandardPdfPipeline with OCR enabled, not a separate pipeline.
Pipeline Comparison for Scanned Documents#
| Criterion | StandardPdfPipeline | VlmPipeline |
|---|---|---|
| Bounding boxes | Always available | Only with DocTags presets |
| Speed | Faster | Slower (GPU-intensive) |
| Complex layouts | Good | Better semantic understanding |
| Resource usage | Lower | Higher |
| OCR quality | Depends on engine | Integrated in model |
No other pipelines exist in Docling for scanned documents.
For more details, see the Docling repository pipelines.