Does Docling automatically detect the language of a document when sending images to Tesseract OCR, and how can it be configured?

Documents Docling

Type

Answer

Status

Published

Created

Apr 15, 2026

Updated

Apr 15, 2026

Created by

Dosu Bot

Updated by

Dosu Bot

By default, Docling does not automatically detect the language. It uses a fixed list of languages passed to Tesseract, with the default being ["fra", "deu", "spa", "eng"] (French, German, Spanish, English).

However, optional auto-detection is available: setting lang=["auto"] causes Docling to use Tesseract's OSD (Orientation and Script Detection), which detects the document's script and maps it to a language via the map_tesseract_script() function. No external libraries (e.g., langdetect, fasttext) are used — everything depends solely on Tesseract.

Example – enabling auto-detection:

from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions

ocr_options = TesseractCliOcrOptions(lang=["auto"])
pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    force_full_page_ocr=True,
    ocr_options=ocr_options
)

Example – manual language setting (e.g., Czech):

ocr_options = TesseractCliOcrOptions(lang=["ces", "eng"])

Important notes:

For auto-detection, the osd.traineddata file must be installed, along with .traineddata files for any detected languages.
The TESSDATA_PREFIX environment variable must be set before starting Python.
The default docling-serve Docker image only includes eng and osd — a custom image is required for additional languages.

Does Docling automatically detect the language of a document when sending images to Tesseract OCR, and how can it be configured? | Dosu