Documents
Does Docling automatically detect the language of a document when sending images to Tesseract OCR, and how can it be configured?
Does Docling automatically detect the language of a document when sending images to Tesseract OCR, and how can it be configured?
Type
Answer
Status
Published
Created
Apr 15, 2026
Updated
Apr 15, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

By default, Docling does not automatically detect the language. It uses a fixed list of languages passed to Tesseract, with the default being ["fra", "deu", "spa", "eng"] (French, German, Spanish, English).

However, optional auto-detection is available: setting lang=["auto"] causes Docling to use Tesseract's OSD (Orientation and Script Detection), which detects the document's script and maps it to a language via the map_tesseract_script() function. No external libraries (e.g., langdetect, fasttext) are used — everything depends solely on Tesseract.

Example – enabling auto-detection:

from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions

ocr_options = TesseractCliOcrOptions(lang=["auto"])
pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    force_full_page_ocr=True,
    ocr_options=ocr_options
)

Example – manual language setting (e.g., Czech):

ocr_options = TesseractCliOcrOptions(lang=["ces", "eng"])

Important notes:

  • For auto-detection, the osd.traineddata file must be installed, along with .traineddata files for any detected languages.
  • The TESSDATA_PREFIX environment variable must be set before starting Python.
  • The default docling-serve Docker image only includes eng and osd — a custom image is required for additional languages.
Does Docling automatically detect the language of a document when sending images to Tesseract OCR, and how can it be configured? | Dosu