What are all the pipelines that exist in Docling, including their purposes, selection criteria, and how they handle scanned documents?

Docling contains the following pipelines, each with specific purposes and selection criteria:

Production Pipelines#

StandardPdfPipeline
- Purpose: Multi-threaded PDF processing with layout detection, OCR, and table structure extraction.
- Auto-Selected For: PDF, IMAGE, METS_GBS formats.
- Manual Only: No (default for relevant formats).
- Strengths: Always provides bounding boxes, well-tested, faster due to threading.
- Limitations: Traditional OCR may struggle with complex layouts.
- Scanned Documents: This is the default pipeline for scanned documents, using OCR (EasyOCR by default, with options for Tesseract or RapidOCR).
SimplePipeline
- Purpose: Direct conversion for structured formats.
- Auto-Selected For: DOCX, PPTX, XLSX, HTML, Markdown (.md), plain-text (.txt, .text), Quarto (.qmd), R Markdown (.Rmd), AsciiDoc, LaTeX, CSV, JSON_DOCLING, VTT, XML variants.
- Manual Only: No (default for these formats).
VlmPipeline
- Purpose: Vision-Language Model document understanding.
- Auto-Selected For: None.
- Manual Only: Yes (must be selected with --pipeline vlm).
- Strengths: Better semantic understanding of complex layouts, tables, and figures.
- Limitations: More computationally expensive; bounding boxes only available with DocTags presets.
- Scanned Documents: Can be used for scanned documents when semantic understanding is critical, but not auto-selected.
ExtractionVlmPipeline
- Purpose: Schema-based structured data extraction using VLM models (NuExtract, Granite Vision).
- Auto-Selected For: None.
- Manual Only: Yes.
- Model Selection: Configurable via VlmExtractionPipelineOptions (default: NuExtract-2B, alternative: Granite Vision 4.1).
AsrPipeline
- Purpose: Speech-to-text transcription using Whisper models.
- Auto-Selected For: Audio files.
- Manual Only: No (default for audio).

Experimental Pipeline#

ThreadedLayoutVlmPipeline
- Purpose: Hybrid layout model detection plus VLM processing with spatial context.
- Status: Experimental, not production ready.

Deprecated/Alias Pipelines#

LegacyStandardPdfPipeline: Deprecated, replaced by StandardPdfPipeline.
ThreadedStandardPdfPipeline: Alias for StandardPdfPipeline.

Scanned Documents Clarification#

StandardPdfPipeline with OCR is the default and only auto-selected pipeline for scanned documents. It uses OCR to extract text and layout information.
VlmPipeline can be manually selected for scanned documents when advanced semantic understanding is needed, but is not auto-selected.
The previous mention of "another pipeline" for scanned documents referred to StandardPdfPipeline with OCR enabled, not a separate pipeline.

Pipeline Comparison for Scanned Documents#

Criterion	StandardPdfPipeline	VlmPipeline
Bounding boxes	Always available	Only with DocTags presets
Speed	Faster	Slower (GPU-intensive)
Complex layouts	Good	Better semantic understanding
Resource usage	Lower	Higher
OCR quality	Depends on engine	Integrated in model

No other pipelines exist in Docling for scanned documents.

For more details, see the Docling repository pipelines.