HybridChunker Parameters#
Here are all HybridChunker() parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
tokenizer | BaseTokenizer | default tokenizer | Tokenizer to use (controls max_tokens) |
max_tokens | int | from tokenizer | Maximum tokens per chunk (backward compat) |
merge_peers | bool | True | Merge undersized chunks with same headings |
repeat_table_header | bool | True | Repeat table headers when table is split |
omit_header_on_overflow | bool | False | Omit headers if they'd cause overflow |
serializer_provider | BaseSerializerProvider | default | Controls text serialization |
always_emit_headings | bool | False | Emit headings even for empty sections |
Import#
from docling.chunking import HybridChunker
Usage Examples#
Basic usage with max_tokens:
chunker = HybridChunker(max_tokens=512)
Disable peer merging:
chunker = HybridChunker(max_tokens=512, merge_peers=False)
With a custom HuggingFace tokenizer:
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
max_tokens=256,
)
chunker = HybridChunker(tokenizer=tokenizer)
Full example with chunking and contextualization:
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
converter = DocumentConverter()
result = converter.convert("path/to/document.pdf")
doc = result.document
chunker = HybridChunker(max_tokens=512)
chunks = list(chunker.chunk(dl_doc=doc))
for i, chunk in enumerate(chunks):
print(f"Text: {chunk.text[:200]}")
# Contextualized text prepends section headings — use this for embeddings
enriched = chunker.contextualize(chunk=chunk)
print(f"Contextualized: {enriched[:200]}")
How HybridChunker Works (4-Stage Pipeline)#
- Hierarchical Chunking – Uses an inner
HierarchicalChunkerto create initial chunks based on document structure. - Group Document Items – Combines consecutive elements (paragraphs, list items, tables) until the token limit is reached.
- Split Oversized Items – Splits single elements that exceed
max_tokens; tables useLineBasedTokenChunkerwith header repetition, text uses thesemchunklibrary. - Merge Peers (optional) – When
merge_peers=True, merges consecutive undersized chunks that share the same section headings.
The contextualize() method prepends hierarchical section headings to chunk text, which is recommended for embedding in RAG applications.