Is tokenizers open source?

Yes — huggingface/tokenizers is open source, released under the Apache-2.0 license.

What language is tokenizers written in?

huggingface/tokenizers is primarily written in Rust.

How popular is tokenizers?

huggingface/tokenizers has 10.9k stars on GitHub.

Where can I find tokenizers?

huggingface/tokenizers is on GitHub at https://github.com/huggingface/tokenizers.

← all repositories

huggingface/tokenizers

Rust Tokenizers That Map Every Token Back to Source

It implements BPE, WordPiece, and Unigram in Rust with cross-language bindings so NLP pipelines can stop rewriting tokenization in Python.

★10.9k stars Rust Language Models Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Hugging Face Tokenizers is a Rust library that trains and runs the most widely used NLP tokenization schemes—BPE, WordPiece, and Unigram—from a single shared core. It exposes bindings for Python, Node.js, and Ruby (the latter via an external contribution), and handles the full pre-processing chain including truncation, padding, and special-token injection. The README claims it can tokenize a gigabyte of text on a server CPU in under 20 seconds.

The interesting bit

Most tokenizer libraries split text and move on; this one tracks alignments through normalization, so you can always map a token back to the exact span of the original sentence. That bookkeeping is usually the kind of tedious detail projects skip, but it is exactly what makes the library viable for both research debugging and production pipelines.

Key highlights

Rust implementation with bindings for Python, Node.js, and Ruby.
Supports training new vocabularies from scratch, not just inference.
Normalization preserves alignment mappings to the original text.
Bundles pre-processing steps: truncation, padding, and special-token insertion.
Claims sub-20-second tokenization for 1 GB of text on server-grade CPUs.

Caveats

Performance numbers vary by hardware; the quoted benchmark is from a specific AWS g6 instance test.
The Ruby binding lives in an external repository, so its update cadence is not guaranteed by the core team.
The README teases “more to come” for language bindings, which suggests the current set is not exhaustive.

Verdict

Anyone building or fine-tuning transformer models who is tired of maintaining custom tokenization scripts should look here. If you are only doing occasional, small-scale text processing and don’t care about training vocabularies, the added abstraction is probably unnecessary.

Frequently asked

What is huggingface/tokenizers?: It implements BPE, WordPiece, and Unigram in Rust with cross-language bindings so NLP pipelines can stop rewriting tokenization in Python.
Is tokenizers open source?: Yes — huggingface/tokenizers is open source, released under the Apache-2.0 license.
What language is tokenizers written in?: huggingface/tokenizers is primarily written in Rust.
How popular is tokenizers?: huggingface/tokenizers has 10.9k stars on GitHub.
Where can I find tokenizers?: huggingface/tokenizers is on GitHub at https://github.com/huggingface/tokenizers.