Cause of std::bad_alloc in Docling#
The std::bad_alloc errors occur because Docling's StandardPdfPipeline uses the docling-parse C++ backend, which loads page images into memory for layout detection, OCR, and table structure analysis. When processing large PDFs (700+ pages with images and math formulas), the C++ backend accumulates memory internally and eventually exhausts available RAM — even on machines with 32GB RAM.
This issue is specific to PDFs and image files. DOCX, XLSX, and PPTX use SimplePipeline (direct XML parsing) and are not affected.
Recommended Workarounds#
1. Process in page range batches#
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
pipeline_options = PdfPipelineOptions(
generate_page_images=False,
generate_picture_images=False,
ocr_batch_size=1,
layout_batch_size=1,
table_batch_size=1,
)
# Process 50 pages at a time, creating a fresh converter per batch
for start in range(1, 701, 50):
end = min(start + 49, 700)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
result = converter.convert("your_large.pdf", page_range=(start, end))
# process result...
del converter # help garbage collection
Note: Create a new
DocumentConverterinstance per batch, as the backend may retain caches between calls.
2. Switch to the pypdfium2 backend#
This resolves memory accumulation but has trade-offs for complex/math-heavy PDFs (more fragmented table cell extraction):
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
backend=PyPdfiumDocumentBackend,
pipeline_options=PdfPipelineOptions(
do_table_structure=True,
do_cell_matching=False, # recommended for pypdfium2
)
)
}
)
Known Limitation: Cross-Page Tables#
When batching with page_range, tables that span a batch boundary (e.g., starts on page 100 and continues on page 101) produce separate TableItem objects with no continuation metadata. Docling has no built-in cross-page table merging. The recommended mitigation is:
- Use
pypdfium2externally to extract PDF bookmarks/outlines and split at chapter/section boundaries instead of arbitrary page counts. - Implement LLM-assisted post-processing to detect and merge tables at batch boundaries.
Summary Table#
| Backend | Memory | Table Quality (Math PDFs) |
|---|---|---|
docling-parse (default) | Accumulates ❌ | High ✅ |
pypdfium2 | Efficient ✅ | Reduced ❌ |
For math-heavy academic papers, docling-parse + page_range batching is preferred to preserve table extraction quality.