How does Docling reconstruct reading order without using a large language model (LLM) call in an unsupervised manner?

Docling reconstructs reading order using a purely rule-based spatial algorithm implemented in the ReadingOrderPredictor class from docling-ibm-models. No ML inference is used for the ordering step—only geometric relationships between bounding boxes detected by the upstream layout model.

How It Works#

1. Element Segregation#

Elements are split into page headers, body, and footers so headers always precede body content and footers come last.

2. Spatial Graph Construction#

The algorithm builds directed adjacency maps (up/down predecessor-successor relationships) using R-tree spatial indexing for efficient lookups. For each element, it queries spatially above elements and checks two geometric predicates:

is_strictly_above()
overlaps_horizontally()

It also detects when intermediate elements interrupt the path between two vertically-aligned elements, preventing incorrect links.

3. Horizontal Dilation for Multi-Column Handling#

Bounding boxes are expanded horizontally (up to 15% of page width) to implicitly detect columns through proximity patterns—without an explicit column detection model. Expansion only occurs if it doesn't create overlaps with neighboring elements.

4. Head Finding and DFS Traversal#

Elements with no predecessors ("heads") are identified and sorted spatially:

Top-to-bottom if they overlap horizontally
Left-to-right otherwise

A non-recursive depth-first search (DFS) starting from these heads produces the final reading order sequence.

After establishing reading order, the system also runs relationship detection for captions, footnotes, and mergeable text elements.

Key Files#

Core algorithm: docling_ibm_models/reading_order/reading_order_rb.py
Pipeline integration: docling/models/stages/reading_order/readingorder_model.py
Spatial primitives: docling_core/types/doc/base.py

Why It Works Without LLMs#

Spatial positioning encodes reading order for most documents (top-to-bottom, left-to-right conventions), and the ordering problem reduces to a topological sort over a directed graph—something DFS solves efficiently and deterministically.

Known Limitations#

The 15% dilation threshold is fixed and not user-configurable, which can cause issues with complex multi-column layouts.
Reading order quality depends heavily on the accuracy of upstream layout detection.