Using Docling in Python on WSL2#
What is Docling?#
Docling is a free, open-source document conversion library (MIT License, copyright IBM) for GenAI applications. It supports PDF, DOCX, PPTX, XLSX, HTML, Markdown, LaTeX, images, and audio/video, with capabilities including:
- Advanced layout and table extraction via deep learning models
- OCR for scanned documents
- Chunking optimized for RAG (Retrieval-Augmented Generation)
- Export to Markdown, JSON, HTML, and plain text
- Official integrations with LlamaIndex and LangChain
1. Prerequisites & WSL2 Setup#
Enable WSL2 (PowerShell as Administrator)#
wsl --install
wsl --set-default-version 2
Restart and install Ubuntu from the Microsoft Store.
Install dependencies (Ubuntu)#
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3 python3-pip python3-venv \
tesseract-ocr libtesseract-dev leptonica-dev \
pkg-config ffmpeg build-essential
Requires Python 3.9+
Create a virtual environment#
python3 -m venv ~/docling-env
source ~/docling-env/bin/activate
2. Installation#
# Basic
pip install docling
# With optional extras
pip install "docling[tesserocr]" # High-quality OCR
pip install "docling[vlm]" # Vision-Language Models
pip install "docling[htmlrender]" # Headless HTML rendering
pip install "docling[asr]" # Audio/video transcription
pip install "docling[tesserocr,vlm,htmlrender,asr]" # Full install
3. Basic Usage#
Simple PDF conversion#
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("documento.pdf")
# Export to Markdown
print(result.document.export_to_markdown())
Export to multiple formats#
md = result.document.export_to_markdown()
text = result.document.export_to_text()
html = result.document.export_to_html()
result.document.save_as_markdown("output.md")
result.document.save_as_json("output.json")
result.document.save_as_html("output.html")
Convert multiple documents#
sources = ["doc1.pdf", "doc2.docx", "presentation.pptx"]
results = converter.convert_all(sources)
for res in results:
print(res.document.export_to_markdown()[:500])
4. Advanced Usage#
OCR Configuration#
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions(
lang=["en", "pt"],
use_gpu=True,
confidence_threshold=0.6,
)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
Chunking for RAG#
from docling.chunking import HybridChunker
chunker = HybridChunker(max_tokens=512)
chunks = list(chunker.chunk(dl_doc=result.document))
for chunk in chunks:
enriched_text = chunker.contextualize(chunk=chunk)
print(enriched_text[:200])
LangChain Integration#
pip install langchain-docling
from langchain_docling import DoclingLoader
loader = DoclingLoader("documentos/")
docs = loader.load()
LlamaIndex Integration#
pip install llama-index-readers-docling llama-index-node-parser-docling
from llama_index.readers.docling import DoclingReader
from llama_index.node_parser.docling import DoclingNodeParser
reader = DoclingReader()
docs = reader.load_data(file_path="documento.pdf")
5. Integration with an AI Agent (e.g., Hermes Agent)#
Option A: Python script as a skill#
#!/usr/bin/env python3
import sys, json
from docling.document_converter import DocumentConverter
def convert_document(file_path: str, output_format: str = "markdown") -> dict:
converter = DocumentConverter()
result = converter.convert(file_path)
exporters = {
"markdown": result.document.export_to_markdown,
"text": result.document.export_to_text,
"html": result.document.export_to_html,
}
content = exporters.get(output_format, exporters["markdown"])()
return {"status": "success", "format": output_format, "content": content}
if __name__ == "__main__":
file_path = sys.argv[1]
fmt = sys.argv[2] if len(sys.argv) > 2 else "markdown"
print(json.dumps(convert_document(file_path, fmt), ensure_ascii=False, indent=2))
Usage: python docling_convert.py documento.pdf markdown
Option B: docling-serve as an HTTP API#
pip install "docling-serve[ui]"
docling-serve run --host 0.0.0.0 --port 5001
import requests
# Convert via file upload
with open("documento.pdf", "rb") as f:
response = requests.post(
"http://localhost:5001/v1/convert/file",
files={"files": ("documento.pdf", f, "application/pdf")},
data={"to_formats": "md"},
)
print(response.json())
# Convert via URL
response = requests.post(
"http://localhost:5001/v1/convert/source",
json={
"options": {"to_formats": ["md"]},
"sources": [{"kind": "http", "url": "https://arxiv.org/pdf/2408.09869"}],
},
)
6. Troubleshooting on WSL2#
| Issue | Solution |
|---|---|
| Docling freezes on WSL2 | Use PyPdfiumDocumentBackend |
| PyTorch/CUDA errors | Pin pip install torch==2.5.1 |
| Memory leak on successive conversions | Set pipeline_options.generate_parsed_pages = False; call gc.collect() after each conversion |
| Slow filesystem performance | Always work in native Linux paths (~/docs/), not /mnt/c/ |
| WSL2 memory limits | Configure %USERPROFILE%\.wslconfig with memory=8GB |
Fix for Docling freezing:#
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(backend=PyPdfiumDocumentBackend)}
)
Note: Official Windows/WSL2 support is limited. The Docling team recommends native Linux for production use.