What are the steps to convert a complex payment advice PDF to Excel using Docling?

Here is a step-by-step guide to converting a complex payment advice PDF to Excel using Docling:

1. Set Up Your Environment#

Install Python 3.9+, create a virtual environment, and install the required dependencies:

pip install docling pandas openpyxl

2. Configure the Converter for Complex Tables#

Payment advice PDFs often have merged cells and complex headers. Use TableFormer V2 and enable OCR if the PDF is scanned:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions, TableStructureV2Options
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True # Enable if PDF is scanned/image-based
pipeline_options.ocr_options = EasyOcrOptions(lang=["en"], use_gpu=False)
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = TableStructureV2Options(do_cell_matching=True)

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)

3. Convert Your PDF#

result = converter.convert("payment_advice.pdf")

4. Extract Tables and Export to Excel#

for i, table in enumerate(result.document.tables):
    df = table.export_to_dataframe(doc=result.document)
    df.to_excel(f"payment_table_{i}.xlsx", index=False)
    print(f"Exported table {i} with shape {df.shape}")

5. (Optional) Save the Full Document as JSON#

JSON is lossless — you can reload and re-export tables later without re-converting:

result.document.save_as_json("payment_advice.json")

6. Validate and Refine#

Open the exported Excel files and check for accuracy (column alignment, merged cells, missing data).
If results are poor, try these tweaks:
- Set do_cell_matching=False if merged PDF cells break column structure.
- For scanned PDFs, use EasyOCR (more accurate for financial/numeric tables than RapidOCR).
- For very complex tables with hierarchies, consider using GraniteVisionTableStructureOptions (requires GPU and pip install docling[vlm]).

Summary Checklist#

#	Task	Notes
1	Install Python + dependencies	`pip install docling pandas openpyxl`
2	Configure pipeline options	Enable OCR + TableFormer V2
3	Convert PDF	`converter.convert("file.pdf")`
4	Export tables to Excel	`table.export_to_dataframe()` → `.to_excel()`
5	Save JSON (optional)	For future re-exports without reconverting
6	Validate & tune	Adjust OCR/table options if needed