Documents
What are all the pipelines that exist in Docling, including their purposes, selection criteria, and how they handle scanned documents?
What are all the pipelines that exist in Docling, including their purposes, selection criteria, and how they handle scanned documents?
Type
Answer
Status
Published
Created
Feb 25, 2026
Updated
Mar 20, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Docling contains the following pipelines, each with specific purposes and selection criteria:

Production Pipelines#

  1. StandardPdfPipeline

    • Purpose: Multi-threaded PDF processing with layout detection, OCR, and table structure extraction.
    • Auto-Selected For: PDF, IMAGE, METS_GBS formats.
    • Manual Only: No (default for relevant formats).
    • Strengths: Always provides bounding boxes, well-tested, faster due to threading.
    • Limitations: Traditional OCR may struggle with complex layouts.
    • Scanned Documents: This is the default pipeline for scanned documents, using OCR (EasyOCR by default, with options for Tesseract or RapidOCR).
  2. SimplePipeline

    • Purpose: Direct conversion for structured formats.
    • Auto-Selected For: DOCX, PPTX, XLSX, HTML, Markdown (.md), plain-text (.txt, .text), Quarto (.qmd), R Markdown (.Rmd), AsciiDoc, LaTeX, CSV, JSON_DOCLING, VTT, XML variants.
    • Manual Only: No (default for these formats).
  3. VlmPipeline

    • Purpose: Vision-Language Model document understanding.
    • Auto-Selected For: None.
    • Manual Only: Yes (must be selected with --pipeline vlm).
    • Strengths: Better semantic understanding of complex layouts, tables, and figures.
    • Limitations: More computationally expensive; bounding boxes only available with DocTags presets.
    • Scanned Documents: Can be used for scanned documents when semantic understanding is critical, but not auto-selected.
  4. ExtractionVlmPipeline

    • Purpose: Schema-based structured data extraction using NuExtract VLM.
    • Auto-Selected For: None.
    • Manual Only: Yes.
  5. AsrPipeline

    • Purpose: Speech-to-text transcription using Whisper models.
    • Auto-Selected For: Audio files.
    • Manual Only: No (default for audio).

Experimental Pipeline#

  • ThreadedLayoutVlmPipeline
    • Purpose: Hybrid layout model detection plus VLM processing with spatial context.
    • Status: Experimental, not production ready.

Deprecated/Alias Pipelines#

  • LegacyStandardPdfPipeline: Deprecated, replaced by StandardPdfPipeline.
  • ThreadedStandardPdfPipeline: Alias for StandardPdfPipeline.

Scanned Documents Clarification#

  • StandardPdfPipeline with OCR is the default and only auto-selected pipeline for scanned documents. It uses OCR to extract text and layout information.
  • VlmPipeline can be manually selected for scanned documents when advanced semantic understanding is needed, but is not auto-selected.
  • The previous mention of "another pipeline" for scanned documents referred to StandardPdfPipeline with OCR enabled, not a separate pipeline.

Pipeline Comparison for Scanned Documents#

CriterionStandardPdfPipelineVlmPipeline
Bounding boxesAlways availableOnly with DocTags presets
SpeedFasterSlower (GPU-intensive)
Complex layoutsGoodBetter semantic understanding
Resource usageLowerHigher
OCR qualityDepends on engineIntegrated in model

No other pipelines exist in Docling for scanned documents.

For more details, see the Docling repository pipelines.