← all repositories

huggingface/tokenizers

A high-performance Rust library providing fast tokenizers for NLP and language model preprocessing.

10.8k stars Rust Language ModelsData Tooling
tokenizers
Velocity · 7d
+4.5
★ / day
Trend
steady
star history

Provides implementations of today’s most widely used tokenizers with a focus on performance and versatility. The Rust implementation enables extremely fast training and tokenization, processing gigabytes of text on CPU in seconds. Designed for both research and production use, it handles vocabulary training, text-to-token conversion, normalization with alignment tracking, and preprocessing tasks like truncation, padding, and special token insertion. Available via bindings for Python, Node.js, Ruby, and other languages.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.