Documents
When processing large PDFs (700+ pages) with Docling's StandardPdfPipeline, what causes the `std::bad_alloc` errors and what are the recommended workarounds?
When processing large PDFs (700+ pages) with Docling's StandardPdfPipeline, what causes the `std::bad_alloc` errors and what are the recommended workarounds?
Type
Answer
Status
Published
Created
Apr 17, 2026
Updated
Apr 17, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Cause of std::bad_alloc in Docling#

The std::bad_alloc errors occur because Docling's StandardPdfPipeline uses the docling-parse C++ backend, which loads page images into memory for layout detection, OCR, and table structure analysis. When processing large PDFs (700+ pages with images and math formulas), the C++ backend accumulates memory internally and eventually exhausts available RAM — even on machines with 32GB RAM.

This issue is specific to PDFs and image files. DOCX, XLSX, and PPTX use SimplePipeline (direct XML parsing) and are not affected.


1. Process in page range batches#

from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption

pipeline_options = PdfPipelineOptions(
    generate_page_images=False,
    generate_picture_images=False,
    ocr_batch_size=1,
    layout_batch_size=1,
    table_batch_size=1,
)

# Process 50 pages at a time, creating a fresh converter per batch
for start in range(1, 701, 50):
    end = min(start + 49, 700)
    converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )
    result = converter.convert("your_large.pdf", page_range=(start, end))
    # process result...
    del converter # help garbage collection

Note: Create a new DocumentConverter instance per batch, as the backend may retain caches between calls.

2. Switch to the pypdfium2 backend#

This resolves memory accumulation but has trade-offs for complex/math-heavy PDFs (more fragmented table cell extraction):

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            backend=PyPdfiumDocumentBackend,
            pipeline_options=PdfPipelineOptions(
                do_table_structure=True,
                do_cell_matching=False, # recommended for pypdfium2
            )
        )
    }
)

Known Limitation: Cross-Page Tables#

When batching with page_range, tables that span a batch boundary (e.g., starts on page 100 and continues on page 101) produce separate TableItem objects with no continuation metadata. Docling has no built-in cross-page table merging. The recommended mitigation is:

  1. Use pypdfium2 externally to extract PDF bookmarks/outlines and split at chapter/section boundaries instead of arbitrary page counts.
  2. Implement LLM-assisted post-processing to detect and merge tables at batch boundaries.

Summary Table#

BackendMemoryTable Quality (Math PDFs)
docling-parse (default)Accumulates ❌High ✅
pypdfium2Efficient ✅Reduced ❌

For math-heavy academic papers, docling-parse + page_range batching is preferred to preserve table extraction quality.