Why are bounding boxes not available for items (such as images, tables, text) when using VlmPipeline with certain VLM presets (like Qwen/Markdown)?

Bounding boxes are not available for items when using VlmPipeline with Markdown-based VLM presets (such as Qwen and LightOnOCR-2-1B) because these models output only plain Markdown text, which lacks spatial or structural tags. The Markdown parser in Docling only creates picture elements from explicit Markdown image syntax (![alt](url)), which VLMs typically do not generate. As a result, no PictureItem objects (or bounding boxes) are created for images, and similarly, tables and text do not receive bounding box information.

In contrast, structured formats like DocTags and DOCLANG provide spatial information with bounding box coordinates. DocTags-based presets (like granite_docling or smoldocling) output explicit tags (e.g., <picture><loc_104><loc_85>...</picture>), while DOCLANG outputs structured XML with provenance data, both enabling Docling to detect spatial information and assign bounding boxes to items. StandardPdfPipeline also provides bounding boxes by analyzing the PDF layout directly.

Summary:

Markdown VLMs (e.g., Qwen, LightOnOCR-2-1B): Only output text, no spatial tags → no bounding boxes for any items.
DocTags VLMs (e.g., granite_docling, smoldocling): Output spatial tags → bounding boxes are available.
DOCLANG VLMs: Output structured XML with provenance data → bounding boxes are available.
StandardPdfPipeline: Uses layout analysis → bounding boxes are available.

References:

Key insight: Only VLMs trained to output structured formats with spatial information (DocTags or DOCLANG) or pipelines that analyze document layout (StandardPdfPipeline) can provide bounding boxes for items.