Documents
What are all the parameters of HybridChunker in Docling, and how do you use it with a custom tokenizer?
What are all the parameters of HybridChunker in Docling, and how do you use it with a custom tokenizer?
Type
Answer
Status
Published
Created
Mar 24, 2026
Updated
Mar 24, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

HybridChunker Parameters#

Here are all HybridChunker() parameters:

ParameterTypeDefaultDescription
tokenizerBaseTokenizerdefault tokenizerTokenizer to use (controls max_tokens)
max_tokensintfrom tokenizerMaximum tokens per chunk (backward compat)
merge_peersboolTrueMerge undersized chunks with same headings
repeat_table_headerboolTrueRepeat table headers when table is split
omit_header_on_overflowboolFalseOmit headers if they'd cause overflow
serializer_providerBaseSerializerProviderdefaultControls text serialization
always_emit_headingsboolFalseEmit headings even for empty sections

Import#

from docling.chunking import HybridChunker

Usage Examples#

Basic usage with max_tokens:

chunker = HybridChunker(max_tokens=512)

Disable peer merging:

chunker = HybridChunker(max_tokens=512, merge_peers=False)

With a custom HuggingFace tokenizer:

from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer

tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
    max_tokens=256,
)
chunker = HybridChunker(tokenizer=tokenizer)

Full example with chunking and contextualization:

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

converter = DocumentConverter()
result = converter.convert("path/to/document.pdf")
doc = result.document

chunker = HybridChunker(max_tokens=512)
chunks = list(chunker.chunk(dl_doc=doc))

for i, chunk in enumerate(chunks):
    print(f"Text: {chunk.text[:200]}")
    # Contextualized text prepends section headings — use this for embeddings
    enriched = chunker.contextualize(chunk=chunk)
    print(f"Contextualized: {enriched[:200]}")

How HybridChunker Works (4-Stage Pipeline)#

  1. Hierarchical Chunking – Uses an inner HierarchicalChunker to create initial chunks based on document structure.
  2. Group Document Items – Combines consecutive elements (paragraphs, list items, tables) until the token limit is reached.
  3. Split Oversized Items – Splits single elements that exceed max_tokens; tables use LineBasedTokenChunker with header repetition, text uses the semchunk library.
  4. Merge Peers (optional) – When merge_peers=True, merges consecutive undersized chunks that share the same section headings.

The contextualize() method prepends hierarchical section headings to chunk text, which is recommended for embedding in RAG applications.