chonkie-inc/chonkie

A pygmy hippo that chunks text at 100 GB/s

Chonkie wraps every text-splitting strategy you keep rewriting into one install-what-you-need Python library.

★4.1k stars Python RAG · Search Data Tooling

View on GitHub ↗ Homepage ↗

Velocity · 7d

+9.5

★ / day

Trend

→steady

star history

What it does Chonkie is a Python chunking toolkit for RAG pipelines. It bundles nine chunkers—from naive token splitting to LLM-based “Slumber” chunking—plus refineries, vector-DB handshakes, and a self-hosted REST API. The default install is 505 KB; extras are opt-in so you don’t drag in half of PyTorch just to split a README.

The interesting bit The Pipeline class lets you chain chunkers and refineries declaratively—recursive chunk at 2K tokens, semantic chunk at 512, add overlap, embed, and ship to Qdrant in a fluent API. Pipelines are also storable and reusable via the REST API’s SQLite-backed registry, which turns a Python library into a chunking microservice with chonkie serve.

Key highlights

FastChunker claims SIMD-accelerated, byte-based chunking at “100+ GB/s” on CPU
32+ integrations including 8 vector DB handshakes (Chroma, Pinecone, pgvector, etc.) and multiple tokenizer backends
Optional installs per component—chonkie[semantic] for embeddings, chonkie[tiktoken] for OpenAI token counting, etc.
Self-hosted API with Docker Compose support and interactive /docs
56-language support out of the box

Caveats

The “100+ GB/s” claim for FastChunker lacks reproducible benchmark details in the README; treat as a marketing figure until verified
The README is truncated mid-sentence in the transformers tokenizer section, so full tokenizer coverage is unclear
chonkie[all] is explicitly “not recommended for production environments”

Verdict Worth a look if you’re maintaining yet another bespoke chunking script and want one library with swap-in strategies. Skip it if you already have a deeply customized NLP pipeline that you trust—Chonkie is glue, not magic, and the mascot knows it.