Counting GPT tokens without the API round-trip
A fast, standalone BPE tokenizer that lets you preview exactly how OpenAI models see your text — and how much it'll cost.
What it does
tiktoken is OpenAI’s own byte-pair encoding tokenizer, packaged as a Python library. It converts text to tokens (and back) using the same vocabularies as GPT-4o, GPT-4, GPT-3.5, and earlier models — so you can count tokens locally instead of shipping text to an API just to find out how long it is.
The interesting bit
The README claims it’s 3–6× faster than comparable open-source tokenizers, measured on 1 GB of text with GPT-2’s vocabulary. More unusually, it ships an _educational submodule that lets you train a miniature BPE tokenizer and visualize how the real GPT-4 encoder splits words — useful if you’ve ever wondered why “encoding” becomes “encod” + “ing”.
Key highlights
- Exact model parity:
encoding_for_model("gpt-4o")returns the same tokenizer the API uses - Extensible via namespace packages (
tiktoken_ext) for custom encodings, or just instantiateEncodingdirectly - Reversible and lossless — decode your tokens back to the original text byte-for-byte
- Handles arbitrary text, even unseen strings, with roughly 4 bytes per token on average
- Educational submodule includes
train_simple_encoding()and visualization helpers
Caveats
- Performance benchmark is from tiktoken 0.2.0 vs. transformers 4.24.0 — versions have moved on since
- The extension mechanism warns against editable installs and private-attribute access, suggesting the API is still settling
Verdict
Essential if you’re building cost estimators, prompt compressors, or anything that needs to know token counts before hitting OpenAI’s API. Skip it if you’re only using open-weight models with different vocabularies — this is tightly coupled to OpenAI’s encodings.