Documents
What are all the parameters of HybridChunker in Docling, and how do you use it with a custom tokenizer?
What are all the parameters of HybridChunker in Docling, and how do you use it with a custom tokenizer?
Type
Answer
Status
Published
Created
Mar 24, 2026
Updated
May 13, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

HybridChunker Parameters#

Here are all HybridChunker() parameters:

ParameterTypeDefaultDescription
tokenizerBaseTokenizerdefault tokenizerTokenizer to use (controls max_tokens)
max_tokensintfrom tokenizerMaximum tokens per chunk (backward compat)
merge_peersboolTrueMerge undersized chunks with same headings
repeat_table_headerboolTrueRepeat table headers when table is split
omit_header_on_overflowboolFalseOmit headers if they'd cause overflow
serializer_providerBaseSerializerProviderdefaultControls text serialization
always_emit_headingsboolFalseEmit headings even for empty sections
use_markdown_imagesboolFalseSerialize images as markdown references and add has_image to metadata
image_placeholderstr"![IMAGE]"Placeholder for images when use_markdown_images=False

Import#

from docling.chunking import HybridChunker

Usage Examples#

Basic usage with max_tokens:

chunker = HybridChunker(max_tokens=512)

Disable peer merging:

chunker = HybridChunker(max_tokens=512, merge_peers=False)

With a custom HuggingFace tokenizer:

from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer

tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
    max_tokens=256,
)
chunker = HybridChunker(tokenizer=tokenizer)

Full example with chunking and contextualization:

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

converter = DocumentConverter()
result = converter.convert("path/to/document.pdf")
doc = result.document

chunker = HybridChunker(max_tokens=512)
chunks = list(chunker.chunk(dl_doc=doc))

for i, chunk in enumerate(chunks):
    print(f"Text: {chunk.text[:200]}")
    # Contextualized text prepends section headings — use this for embeddings
    enriched = chunker.contextualize(chunk=chunk)
    print(f"Contextualized: {enriched[:200]}")

How HybridChunker Works (4-Stage Pipeline)#

  1. Hierarchical Chunking – Uses an inner HierarchicalChunker to create initial chunks based on document structure.
  2. Group Document Items – Combines consecutive elements (paragraphs, list items, tables) until the token limit is reached.
  3. Split Oversized Items – Splits single elements that exceed max_tokens; tables use LineBasedTokenChunker with header repetition, text uses the semchunk library.
  4. Merge Peers (optional) – When merge_peers=True, merges consecutive undersized chunks that share the same section headings.

The contextualize() method prepends hierarchical section headings to chunk text, which is recommended for embedding in RAG applications.

What are all the parameters of HybridChunker in Docling, and how do you use it with a custom tokenizer? | Dosu