When processing large PDFs (700+ pages) with Docling's StandardPdfPipeline, what causes the `std::bad_alloc` errors and what are the recommended workarounds?

Cause of `std::bad_alloc` in Docling#

The std::bad_alloc errors occur because Docling's StandardPdfPipeline uses the docling-parse C++ backend, which loads page images into memory for layout detection, OCR, and table structure analysis. When processing large PDFs (700+ pages with images and math formulas), the C++ backend accumulates memory internally and eventually exhausts available RAM — even on machines with 32GB RAM.

This issue is specific to PDFs and image files. DOCX, XLSX, and PPTX use SimplePipeline (direct XML parsing) and are not affected.

Recommended Workarounds#

1. Process in page range batches#

from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption

pipeline_options = PdfPipelineOptions(
    generate_page_images=False,
    generate_picture_images=False,
    ocr_batch_size=1,
    layout_batch_size=1,
    table_batch_size=1,
)

# Process 50 pages at a time, creating a fresh converter per batch
for start in range(1, 701, 50):
    end = min(start + 49, 700)
    converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )
    result = converter.convert("your_large.pdf", page_range=(start, end))
    # process result...
    del converter # help garbage collection

Note: Create a new DocumentConverter instance per batch, as the backend may retain caches between calls.

2. Switch to the `pypdfium2` backend#

This resolves memory accumulation but has trade-offs for complex/math-heavy PDFs (more fragmented table cell extraction):

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            backend=PyPdfiumDocumentBackend,
            pipeline_options=PdfPipelineOptions(
                do_table_structure=True,
                do_cell_matching=False, # recommended for pypdfium2
            )
        )
    }
)

Known Limitation: Cross-Page Tables#

When batching with page_range, tables that span a batch boundary (e.g., starts on page 100 and continues on page 101) produce separate TableItem objects with no continuation metadata. Docling has no built-in cross-page table merging. The recommended mitigation is:

Use pypdfium2 externally to extract PDF bookmarks/outlines and split at chapter/section boundaries instead of arbitrary page counts.
Implement LLM-assisted post-processing to detect and merge tables at batch boundaries.

Summary Table#

Backend	Memory	Table Quality (Math PDFs)
`docling-parse` (default)	Accumulates ❌	High ✅
`pypdfium2`	Efficient ✅	Reduced ❌

For math-heavy academic papers, docling-parse + page_range batching is preferred to preserve table extraction quality.