Documents
What is the best practice for extracting text from a mixed PDF (digital + scanned pages) for enterprise financial document processing without a GPU and with limited costs?
What is the best practice for extracting text from a mixed PDF (digital + scanned pages) for enterprise financial document processing without a GPU and with limited costs?
Type
Answer
Status
Published
Created
May 23, 2026
Updated
May 23, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Best Practice: Single Docling Pipeline (No Manual Combination Needed)#

Docling automatically handles mixed PDFs (digital + scanned in one file). Each page is analyzed by bitmap coverage:

  • Digital pages → text extracted directly from PDF (OCR not run)
  • Scanned pages → OCR runs automatically only on bitmap areas
  • Mixed pages → native text + OCR combined, duplicates filtered via spatial analysis

You do not need to manually combine PyMuPDF, EasyOCR, Tesseract, and VLM pipeline. One Docling pipeline handles it all.


from docling.datamodel.pipeline_options import (
    PdfPipelineOptions, RapidOcrOptions,
    TableStructureOptions, TableFormerMode
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
import os

os.environ["OMP_NUM_THREADS"] = "4"

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=RapidOcrOptions(), # CPU-friendly, ~6s/page
    images_scale=2.0, # Higher resolution for numeric accuracy
    do_table_structure=True,
    table_structure_options=TableStructureOptions(
        mode=TableFormerMode.ACCURATE,
        do_cell_matching=False, # Important for scanned pages
    ),
    generate_page_images=False,
    generate_picture_images=False,
    do_code_enrichment=False,
    do_formula_enrichment=False,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("financial_document.pdf")
md = result.document.export_to_markdown()

Why This Is Sufficient#

ConcernAnswer
Need PyMuPDF?❌ No. Docling's backend extracts native text automatically
Need EasyOCR + Tesseract?❌ One engine is enough. RapidOCR is best for CPU — ~6s/page vs ~70s for EasyOCR on CPU
Need VLM pipeline?❌ Not worth it without GPU or with limited API budget. StandardPipeline + OCR is sufficient
Sensitive data?✅ All processing is 100% local, no data sent to any API
Need to combine all tools?❌ One Docling pipeline already auto-combines native text + OCR per page

Tips for Financial Document Accuracy#

  1. images_scale=2.0 — higher resolution makes small numbers more readable
  2. TableFormerMode.ACCURATE — more precise table extraction (slower but more accurate)
  3. Post-processing validation — validate number formats (currency, thousands, decimals) with regex after extraction
  4. Estimated time: ~50 mixed pages ≈ 5–15 minutes on CPU

When to Add VLM/OpenAI API#

Try the above pipeline first. Only add VLM (via OpenAI API at ~$0.01–$0.04/page) if specific pages have poor results (e.g., extremely complex tables with merged cells). Send only those problematic pages to the API to minimize cost.

What is the best practice for extracting text from a mixed PDF (digital + scanned pages) for enterprise financial document processing without a GPU and with limited costs? | Dosu