Best Practice: Mixed PDF, CPU-Only, Sensitive Financial Data in Docling#
Key Insight: No Manual Combination Needed#
Docling automatically handles mixed PDFs (digital + scanned in one file). Per-page bitmap coverage analysis determines:
- Digital pages → text extracted directly from PDF (no OCR)
- Scanned pages → OCR runs automatically only on bitmap areas
- Mixed pages → native text + OCR combined, duplicates filtered via spatial analysis
Recommended Pipeline (CPU-Only, No Cost)#
import os
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import (
PdfPipelineOptions, RapidOcrOptions,
TableStructureOptions, TableFormerMode,
)
from docling.datamodel.base_models import InputFormat
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
os.environ["OMP_NUM_THREADS"] = "4"
pipeline_options = PdfPipelineOptions(
do_ocr=True,
ocr_options=RapidOcrOptions(), # CPU-friendly, ~6s/page
images_scale=2.0, # Higher resolution = better number accuracy
do_table_structure=True,
table_structure_options=TableStructureOptions(
mode=TableFormerMode.ACCURATE, # Most precise for financial tables
do_cell_matching=False, # Important for scanned pages
),
generate_page_images=False,
generate_picture_images=False,
do_code_enrichment=False,
do_formula_enrichment=False,
accelerator_options=AcceleratorOptions(num_threads=4, device=AcceleratorDevice.CPU),
)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
result = converter.convert("laporan_keuangan.pdf")
doc = result.document
# Export for AI agent
md = doc.export_to_markdown(traverse_pictures=True) # traverse_pictures=True is required
doc.save_as_json("output/document.json") # Lossless backup
Key configuration choices:
| Option | Reason |
|---|---|
RapidOcrOptions() | CPU-friendly OCR, ~6s/page (vs EasyOCR ~70s/page on CPU) |
images_scale=2.0 | 2x resolution improves readability of small numbers |
TableFormerMode.ACCURATE | Most precise mode for financial tables |
do_cell_matching=False | Let TableFormer define cells independently (critical for scanned pages) |
traverse_pictures=True | Required during export so OCR text from scanned pages is included |
OCR Engine Comparison (CPU-Only)#
| Engine | Speed (CPU) | Accuracy | Notes |
|---|---|---|---|
| RapidOCR ✅ | ~6s/page | Good | Best choice for CPU |
| EasyOCR | ~70s/page | Better for numeric tables | Too slow without GPU |
| Tesseract | ~6s/page | Good for standard text | Stable and lightweight |
Recommendation: Use RapidOCR for CPU. EasyOCR is more accurate for numeric tables but ~10x slower without a GPU.
If the PDF Contains Multiple Logical Documents (e.g., invoice + report + receipt)#
Split by keyword first, then process each logical document independently:
import fitz # PyMuPDF
BOUNDARY_KEYWORDS = ["INVOICE", "FAKTUR", "LAPORAN KEUANGAN", "BANK STATEMENT", "RECEIPT"]
def detect_boundaries(pdf_path):
doc = fitz.open(pdf_path)
documents = []
current_doc = {"type": "unknown", "start_page": 0}
for page_num in range(len(doc)):
text = doc[page_num].get_text().upper()
for keyword in BOUNDARY_KEYWORDS:
if keyword in text:
current_doc["end_page"] = page_num - 1
if page_num > 0:
documents.append(current_doc)
current_doc = {"type": keyword.lower(), "start_page": page_num}
break
current_doc["end_page"] = len(doc) - 1
documents.append(current_doc)
doc.close()
return documents
Then run Docling on each split document separately.
Estimated Processing Time (50 pages, CPU)#
| Scenario | Time |
|---|---|
| All digital pages (no OCR) | ~1–2 minutes |
| 50% scanned pages | ~5–8 minutes |
| All scanned pages | ~10–15 minutes |
When to Use VLM Pipeline (OpenAI API)#
Skip VLM for now. Use it only if specific pages have very complex layouts (e.g., merged cells, extreme hierarchical tables) after reviewing the initial output. At ~$0.01–$0.04/page for GPT-4o, selectively sending only problematic pages keeps costs low.
All built-in OCR engines (RapidOCR, EasyOCR, Tesseract) are free, local, and send no data to external APIs.