By default, Docling does not automatically detect the language. It uses a fixed list of languages passed to Tesseract, with the default being ["fra", "deu", "spa", "eng"] (French, German, Spanish, English).
However, optional auto-detection is available: setting lang=["auto"] causes Docling to use Tesseract's OSD (Orientation and Script Detection), which detects the document's script and maps it to a language via the map_tesseract_script() function. No external libraries (e.g., langdetect, fasttext) are used — everything depends solely on Tesseract.
Example – enabling auto-detection:
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions
ocr_options = TesseractCliOcrOptions(lang=["auto"])
pipeline_options = PdfPipelineOptions(
do_ocr=True,
force_full_page_ocr=True,
ocr_options=ocr_options
)
Example – manual language setting (e.g., Czech):
ocr_options = TesseractCliOcrOptions(lang=["ces", "eng"])
Important notes:
- For auto-detection, the
osd.traineddatafile must be installed, along with.traineddatafiles for any detected languages. - The
TESSDATA_PREFIXenvironment variable must be set before starting Python. - The default
docling-serveDocker image only includesengandosd— a custom image is required for additional languages.