Here is a step-by-step guide to converting a complex payment advice PDF to Excel using Docling:
1. Set Up Your Environment#
Install Python 3.9+, create a virtual environment, and install the required dependencies:
pip install docling pandas openpyxl
2. Configure the Converter for Complex Tables#
Payment advice PDFs often have merged cells and complex headers. Use TableFormer V2 and enable OCR if the PDF is scanned:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions, TableStructureV2Options
from docling.datamodel.base_models import InputFormat
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True # Enable if PDF is scanned/image-based
pipeline_options.ocr_options = EasyOcrOptions(lang=["en"], use_gpu=False)
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = TableStructureV2Options(do_cell_matching=True)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
3. Convert Your PDF#
result = converter.convert("payment_advice.pdf")
4. Extract Tables and Export to Excel#
for i, table in enumerate(result.document.tables):
df = table.export_to_dataframe(doc=result.document)
df.to_excel(f"payment_table_{i}.xlsx", index=False)
print(f"Exported table {i} with shape {df.shape}")
5. (Optional) Save the Full Document as JSON#
JSON is lossless — you can reload and re-export tables later without re-converting:
result.document.save_as_json("payment_advice.json")
6. Validate and Refine#
- Open the exported Excel files and check for accuracy (column alignment, merged cells, missing data).
- If results are poor, try these tweaks:
- Set
do_cell_matching=Falseif merged PDF cells break column structure. - For scanned PDFs, use EasyOCR (more accurate for financial/numeric tables than RapidOCR).
- For very complex tables with hierarchies, consider using
GraniteVisionTableStructureOptions(requires GPU andpip install docling[vlm]).
- Set
Summary Checklist#
| # | Task | Notes |
|---|---|---|
| 1 | Install Python + dependencies | pip install docling pandas openpyxl |
| 2 | Configure pipeline options | Enable OCR + TableFormer V2 |
| 3 | Convert PDF | converter.convert("file.pdf") |
| 4 | Export tables to Excel | table.export_to_dataframe() → .to_excel() |
| 5 | Save JSON (optional) | For future re-exports without reconverting |
| 6 | Validate & tune | Adjust OCR/table options if needed |