Documents
What are the steps to convert a complex payment advice PDF to Excel using Docling?
What are the steps to convert a complex payment advice PDF to Excel using Docling?
Type
Answer
Status
Published
Created
May 21, 2026
Updated
May 21, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Here is a step-by-step guide to converting a complex payment advice PDF to Excel using Docling:

1. Set Up Your Environment#

Install Python 3.9+, create a virtual environment, and install the required dependencies:

pip install docling pandas openpyxl

2. Configure the Converter for Complex Tables#

Payment advice PDFs often have merged cells and complex headers. Use TableFormer V2 and enable OCR if the PDF is scanned:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions, TableStructureV2Options
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True # Enable if PDF is scanned/image-based
pipeline_options.ocr_options = EasyOcrOptions(lang=["en"], use_gpu=False)
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = TableStructureV2Options(do_cell_matching=True)

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)

3. Convert Your PDF#

result = converter.convert("payment_advice.pdf")

4. Extract Tables and Export to Excel#

for i, table in enumerate(result.document.tables):
    df = table.export_to_dataframe(doc=result.document)
    df.to_excel(f"payment_table_{i}.xlsx", index=False)
    print(f"Exported table {i} with shape {df.shape}")

5. (Optional) Save the Full Document as JSON#

JSON is lossless — you can reload and re-export tables later without re-converting:

result.document.save_as_json("payment_advice.json")

6. Validate and Refine#

  • Open the exported Excel files and check for accuracy (column alignment, merged cells, missing data).
  • If results are poor, try these tweaks:
    • Set do_cell_matching=False if merged PDF cells break column structure.
    • For scanned PDFs, use EasyOCR (more accurate for financial/numeric tables than RapidOCR).
    • For very complex tables with hierarchies, consider using GraniteVisionTableStructureOptions (requires GPU and pip install docling[vlm]).

Summary Checklist#

#TaskNotes
1Install Python + dependenciespip install docling pandas openpyxl
2Configure pipeline optionsEnable OCR + TableFormer V2
3Convert PDFconverter.convert("file.pdf")
4Export tables to Exceltable.export_to_dataframe().to_excel()
5Save JSON (optional)For future re-exports without reconverting
6Validate & tuneAdjust OCR/table options if needed