What is the recommended architecture and implementation for processing a mixed PDF bundle (digital + scanned pages) for an enterprise financial payment verification system (like MRT Jakarta's AI-Augmented Payment Flow), with constraints of no GPU and limited cost?

Recommended Architecture for Mixed PDF Processing (No GPU, Enterprise Financial)#

Overview Flow#

Requester upload PDF bundle
       ↓
PyMuPDF split by keyword (< 1 second)
       ↓
Docling extract per logical document (CPU, ~6s/scanned page)
       ↓
Structured JSON/Markdown per document type
       ↓
AI Agent (OpenAI) for business logic verification
       ↓
Recommendation → Human validation

Key Principles#

Docling automatically handles mixed PDFs: It detects digital vs. scanned pages per page — digital pages use native text extraction, scanned pages trigger OCR automatically. No manual per-page configuration needed.
Split PDF bundle first using simple rule-based keyword detection (not AI classifiers). Financial/government documents have consistent formats.
No need to combine PyMuPDF + EasyOCR + Tesseract + VLM. Docling handles OCR + native text internally. VLM Pipeline is not needed and too costly without GPU.

Step 1: Split PDF Bundle (PyMuPDF)#

import fitz # PyMuPDF

DOCUMENT_MARKERS = {
    "INVOICE": "invoice",
    "FAKTUR": "invoice",
    "KWITANSI": "kwitansi",
    "GOODS RECEIPT": "goods_receipt",
    "BAST": "bast",
    "BERITA ACARA": "bast",
    "FAKTUR PAJAK": "faktur_pajak",
    "SPP": "spp",
    "SURAT PERINTAH": "spp",
}

def split_bundle(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    segments = []
    current = {"type": "unknown", "start": 0}

    for page_num in range(len(doc)):
        text = doc[page_num].get_text().upper()
        for marker, doc_type in DOCUMENT_MARKERS.items():
            if marker in text and (page_num == 0 or doc_type != current["type"]):
                if page_num > current["start"]:
                    current["end"] = page_num - 1
                    segments.append(current)
                current = {"type": doc_type, "start": page_num}
                break

    current["end"] = len(doc) - 1
    segments.append(current)
    doc.close()
    return segments

Step 2: Docling Extraction Per Logical Document (CPU-Only)#

import os
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions, RapidOcrOptions,
    TableStructureOptions, TableFormerMode,
)
from docling.datamodel.base_models import InputFormat

os.environ["OMP_NUM_THREADS"] = "4"

def create_converter():
    pipeline_options = PdfPipelineOptions(
        do_ocr=True,
        ocr_options=RapidOcrOptions(), # CPU-friendly, ~6s/page
        images_scale=2.0, # Higher resolution for numeric accuracy
        do_table_structure=True,
        table_structure_options=TableStructureOptions(
            mode=TableFormerMode.ACCURATE,
            do_cell_matching=False, # Important for scanned pages
        ),
        generate_page_images=False,
        generate_picture_images=False,
        do_code_enrichment=False,
        do_formula_enrichment=False,
    )
    return DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )

Step 3: Structured Output per Document Type#

def process_bundle(pdf_path: str) -> dict:
    converter = create_converter()
    segments = split_bundle(pdf_path)
    extracted = {}

    for seg in segments:
        page_range = (seg["start"] + 1, seg["end"] + 1)
        result = converter.convert(pdf_path, page_range=page_range)
        doc = result.document

        entry = {
            "markdown": doc.export_to_markdown(traverse_pictures=True),
            "tables": [table.export_to_dataframe(doc=doc) for table in doc.tables],
            "json": doc.export_to_dict(),
        }

        extracted.setdefault(seg["type"], []).append(entry)

    return extracted

# Output example:
# {
# "invoice": [{"markdown": "...", "tables": [...]}],
# "kwitansi": [{"markdown": "...", "tables": [...]}],
# "faktur_pajak": [{"markdown": "...", "tables": [...]}],
# }

Configuration Rationale#

Setting	Reason
`RapidOcrOptions()`	Best CPU OCR engine (~6s/page vs ~70s for EasyOCR on CPU)
`images_scale=2.0`	Higher resolution improves accuracy for small numbers
`TableFormerMode.ACCURATE`	Most precise mode for financial tables
`do_cell_matching=False`	Critical for scanned pages — lets TableFormer define cells independently
`traverse_pictures=True`	Required in export to include OCR text from scanned pages

What You Do NOT Need#

❌ EasyOCR (too slow on CPU: ~70s/page)
❌ VLM Pipeline (requires GPU or costly OpenAI API per page)
❌ AI/LLM-based boundary detection (rule-based keywords are sufficient for structured financial documents)
❌ LayoutLM / Donut (overkill)
❌ One giant combined JSON (split per logical document type)

Estimated Performance (50 pages, CPU-only)#

Scenario	Time
All digital pages (no OCR)	~1–2 minutes
50% scanned pages	~5–8 minutes
All scanned pages	~10–15 minutes

Cost: $0 for extraction (all local). OpenAI cost applies only to the AI agent verification layer, not the document extraction step.