Documents
What is the best practice for processing a mixed PDF (digital + scanned pages) containing sensitive financial data in Docling, given no GPU and limited cost?
What is the best practice for processing a mixed PDF (digital + scanned pages) containing sensitive financial data in Docling, given no GPU and limited cost?
Type
Answer
Status
Published
Created
May 23, 2026
Updated
May 23, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Best Practice: Mixed PDF, CPU-Only, Sensitive Financial Data in Docling#

Key Insight: No Manual Combination Needed#

Docling automatically handles mixed PDFs (digital + scanned in one file). Per-page bitmap coverage analysis determines:

  • Digital pages → text extracted directly from PDF (no OCR)
  • Scanned pages → OCR runs automatically only on bitmap areas
  • Mixed pages → native text + OCR combined, duplicates filtered via spatial analysis

import os
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions, RapidOcrOptions,
    TableStructureOptions, TableFormerMode,
)
from docling.datamodel.base_models import InputFormat
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions

os.environ["OMP_NUM_THREADS"] = "4"

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=RapidOcrOptions(), # CPU-friendly, ~6s/page
    images_scale=2.0, # Higher resolution = better number accuracy
    do_table_structure=True,
    table_structure_options=TableStructureOptions(
        mode=TableFormerMode.ACCURATE, # Most precise for financial tables
        do_cell_matching=False, # Important for scanned pages
    ),
    generate_page_images=False,
    generate_picture_images=False,
    do_code_enrichment=False,
    do_formula_enrichment=False,
    accelerator_options=AcceleratorOptions(num_threads=4, device=AcceleratorDevice.CPU),
)

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)

result = converter.convert("laporan_keuangan.pdf")
doc = result.document

# Export for AI agent
md = doc.export_to_markdown(traverse_pictures=True) # traverse_pictures=True is required
doc.save_as_json("output/document.json") # Lossless backup

Key configuration choices:

OptionReason
RapidOcrOptions()CPU-friendly OCR, ~6s/page (vs EasyOCR ~70s/page on CPU)
images_scale=2.02x resolution improves readability of small numbers
TableFormerMode.ACCURATEMost precise mode for financial tables
do_cell_matching=FalseLet TableFormer define cells independently (critical for scanned pages)
traverse_pictures=TrueRequired during export so OCR text from scanned pages is included

OCR Engine Comparison (CPU-Only)#

EngineSpeed (CPU)AccuracyNotes
RapidOCR~6s/pageGoodBest choice for CPU
EasyOCR~70s/pageBetter for numeric tablesToo slow without GPU
Tesseract~6s/pageGood for standard textStable and lightweight

Recommendation: Use RapidOCR for CPU. EasyOCR is more accurate for numeric tables but ~10x slower without a GPU.


If the PDF Contains Multiple Logical Documents (e.g., invoice + report + receipt)#

Split by keyword first, then process each logical document independently:

import fitz # PyMuPDF

BOUNDARY_KEYWORDS = ["INVOICE", "FAKTUR", "LAPORAN KEUANGAN", "BANK STATEMENT", "RECEIPT"]

def detect_boundaries(pdf_path):
    doc = fitz.open(pdf_path)
    documents = []
    current_doc = {"type": "unknown", "start_page": 0}
    for page_num in range(len(doc)):
        text = doc[page_num].get_text().upper()
        for keyword in BOUNDARY_KEYWORDS:
            if keyword in text:
                current_doc["end_page"] = page_num - 1
                if page_num > 0:
                    documents.append(current_doc)
                current_doc = {"type": keyword.lower(), "start_page": page_num}
                break
    current_doc["end_page"] = len(doc) - 1
    documents.append(current_doc)
    doc.close()
    return documents

Then run Docling on each split document separately.


Estimated Processing Time (50 pages, CPU)#

ScenarioTime
All digital pages (no OCR)~1–2 minutes
50% scanned pages~5–8 minutes
All scanned pages~10–15 minutes

When to Use VLM Pipeline (OpenAI API)#

Skip VLM for now. Use it only if specific pages have very complex layouts (e.g., merged cells, extreme hierarchical tables) after reviewing the initial output. At ~$0.01–$0.04/page for GPT-4o, selectively sending only problematic pages keeps costs low.

All built-in OCR engines (RapidOCR, EasyOCR, Tesseract) are free, local, and send no data to external APIs.