Documents
What is the recommended architecture and implementation for processing a mixed PDF bundle (digital + scanned pages) for an enterprise financial payment verification system (like MRT Jakarta's AI-Augmented Payment Flow), with constraints of no GPU and limited cost?
What is the recommended architecture and implementation for processing a mixed PDF bundle (digital + scanned pages) for an enterprise financial payment verification system (like MRT Jakarta's AI-Augmented Payment Flow), with constraints of no GPU and limited cost?
Type
Answer
Status
Published
Created
May 23, 2026
Updated
May 23, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Overview Flow#

Requester upload PDF bundle
PyMuPDF split by keyword (< 1 second)
Docling extract per logical document (CPU, ~6s/scanned page)
Structured JSON/Markdown per document type
AI Agent (OpenAI) for business logic verification
Recommendation → Human validation

Key Principles#

  • Docling automatically handles mixed PDFs: It detects digital vs. scanned pages per page — digital pages use native text extraction, scanned pages trigger OCR automatically. No manual per-page configuration needed.
  • Split PDF bundle first using simple rule-based keyword detection (not AI classifiers). Financial/government documents have consistent formats.
  • No need to combine PyMuPDF + EasyOCR + Tesseract + VLM. Docling handles OCR + native text internally. VLM Pipeline is not needed and too costly without GPU.

Step 1: Split PDF Bundle (PyMuPDF)#

import fitz # PyMuPDF

DOCUMENT_MARKERS = {
    "INVOICE": "invoice",
    "FAKTUR": "invoice",
    "KWITANSI": "kwitansi",
    "GOODS RECEIPT": "goods_receipt",
    "BAST": "bast",
    "BERITA ACARA": "bast",
    "FAKTUR PAJAK": "faktur_pajak",
    "SPP": "spp",
    "SURAT PERINTAH": "spp",
}

def split_bundle(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    segments = []
    current = {"type": "unknown", "start": 0}

    for page_num in range(len(doc)):
        text = doc[page_num].get_text().upper()
        for marker, doc_type in DOCUMENT_MARKERS.items():
            if marker in text and (page_num == 0 or doc_type != current["type"]):
                if page_num > current["start"]:
                    current["end"] = page_num - 1
                    segments.append(current)
                current = {"type": doc_type, "start": page_num}
                break

    current["end"] = len(doc) - 1
    segments.append(current)
    doc.close()
    return segments

Step 2: Docling Extraction Per Logical Document (CPU-Only)#

import os
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions, RapidOcrOptions,
    TableStructureOptions, TableFormerMode,
)
from docling.datamodel.base_models import InputFormat

os.environ["OMP_NUM_THREADS"] = "4"

def create_converter():
    pipeline_options = PdfPipelineOptions(
        do_ocr=True,
        ocr_options=RapidOcrOptions(), # CPU-friendly, ~6s/page
        images_scale=2.0, # Higher resolution for numeric accuracy
        do_table_structure=True,
        table_structure_options=TableStructureOptions(
            mode=TableFormerMode.ACCURATE,
            do_cell_matching=False, # Important for scanned pages
        ),
        generate_page_images=False,
        generate_picture_images=False,
        do_code_enrichment=False,
        do_formula_enrichment=False,
    )
    return DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )

Step 3: Structured Output per Document Type#

def process_bundle(pdf_path: str) -> dict:
    converter = create_converter()
    segments = split_bundle(pdf_path)
    extracted = {}

    for seg in segments:
        page_range = (seg["start"] + 1, seg["end"] + 1)
        result = converter.convert(pdf_path, page_range=page_range)
        doc = result.document

        entry = {
            "markdown": doc.export_to_markdown(traverse_pictures=True),
            "tables": [table.export_to_dataframe(doc=doc) for table in doc.tables],
            "json": doc.export_to_dict(),
        }

        extracted.setdefault(seg["type"], []).append(entry)

    return extracted

# Output example:
# {
# "invoice": [{"markdown": "...", "tables": [...]}],
# "kwitansi": [{"markdown": "...", "tables": [...]}],
# "faktur_pajak": [{"markdown": "...", "tables": [...]}],
# }

Configuration Rationale#

SettingReason
RapidOcrOptions()Best CPU OCR engine (~6s/page vs ~70s for EasyOCR on CPU)
images_scale=2.0Higher resolution improves accuracy for small numbers
TableFormerMode.ACCURATEMost precise mode for financial tables
do_cell_matching=FalseCritical for scanned pages — lets TableFormer define cells independently
traverse_pictures=TrueRequired in export to include OCR text from scanned pages

What You Do NOT Need#

  • ❌ EasyOCR (too slow on CPU: ~70s/page)
  • ❌ VLM Pipeline (requires GPU or costly OpenAI API per page)
  • ❌ AI/LLM-based boundary detection (rule-based keywords are sufficient for structured financial documents)
  • ❌ LayoutLM / Donut (overkill)
  • ❌ One giant combined JSON (split per logical document type)

Estimated Performance (50 pages, CPU-only)#

ScenarioTime
All digital pages (no OCR)~1–2 minutes
50% scanned pages~5–8 minutes
All scanned pages~10–15 minutes

Cost: $0 for extraction (all local). OpenAI cost applies only to the AI agent verification layer, not the document extraction step.