Best Practice: Single Docling Pipeline (No Manual Combination Needed)#
Docling automatically handles mixed PDFs (digital + scanned in one file). Each page is analyzed by bitmap coverage:
- Digital pages → text extracted directly from PDF (OCR not run)
- Scanned pages → OCR runs automatically only on bitmap areas
- Mixed pages → native text + OCR combined, duplicates filtered via spatial analysis
You do not need to manually combine PyMuPDF, EasyOCR, Tesseract, and VLM pipeline. One Docling pipeline handles it all.
Recommended Configuration (No GPU, Cost-Efficient)#
from docling.datamodel.pipeline_options import (
PdfPipelineOptions, RapidOcrOptions,
TableStructureOptions, TableFormerMode
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
import os
os.environ["OMP_NUM_THREADS"] = "4"
pipeline_options = PdfPipelineOptions(
do_ocr=True,
ocr_options=RapidOcrOptions(), # CPU-friendly, ~6s/page
images_scale=2.0, # Higher resolution for numeric accuracy
do_table_structure=True,
table_structure_options=TableStructureOptions(
mode=TableFormerMode.ACCURATE,
do_cell_matching=False, # Important for scanned pages
),
generate_page_images=False,
generate_picture_images=False,
do_code_enrichment=False,
do_formula_enrichment=False,
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = converter.convert("financial_document.pdf")
md = result.document.export_to_markdown()
Why This Is Sufficient#
| Concern | Answer |
|---|---|
| Need PyMuPDF? | ❌ No. Docling's backend extracts native text automatically |
| Need EasyOCR + Tesseract? | ❌ One engine is enough. RapidOCR is best for CPU — ~6s/page vs ~70s for EasyOCR on CPU |
| Need VLM pipeline? | ❌ Not worth it without GPU or with limited API budget. StandardPipeline + OCR is sufficient |
| Sensitive data? | ✅ All processing is 100% local, no data sent to any API |
| Need to combine all tools? | ❌ One Docling pipeline already auto-combines native text + OCR per page |
Tips for Financial Document Accuracy#
images_scale=2.0— higher resolution makes small numbers more readableTableFormerMode.ACCURATE— more precise table extraction (slower but more accurate)- Post-processing validation — validate number formats (currency, thousands, decimals) with regex after extraction
- Estimated time: ~50 mixed pages ≈ 5–15 minutes on CPU
When to Add VLM/OpenAI API#
Try the above pipeline first. Only add VLM (via OpenAI API at ~$0.01–$0.04/page) if specific pages have poor results (e.g., extremely complex tables with merged cells). Send only those problematic pages to the API to minimize cost.