← all repositories
openai/tiktoken

Counting GPT tokens without the API round-trip

A fast, standalone BPE tokenizer that lets you preview exactly how OpenAI models see your text — and how much it'll cost.

18.4k stars Python Other AI
tiktoken
Velocity · 7d
+14
★ / day
Trend
steady
star history

What it does

tiktoken is OpenAI’s own byte-pair encoding tokenizer, packaged as a Python library. It converts text to tokens (and back) using the same vocabularies as GPT-4o, GPT-4, GPT-3.5, and earlier models — so you can count tokens locally instead of shipping text to an API just to find out how long it is.

The interesting bit

The README claims it’s 3–6× faster than comparable open-source tokenizers, measured on 1 GB of text with GPT-2’s vocabulary. More unusually, it ships an _educational submodule that lets you train a miniature BPE tokenizer and visualize how the real GPT-4 encoder splits words — useful if you’ve ever wondered why “encoding” becomes “encod” + “ing”.

Key highlights

  • Exact model parity: encoding_for_model("gpt-4o") returns the same tokenizer the API uses
  • Extensible via namespace packages (tiktoken_ext) for custom encodings, or just instantiate Encoding directly
  • Reversible and lossless — decode your tokens back to the original text byte-for-byte
  • Handles arbitrary text, even unseen strings, with roughly 4 bytes per token on average
  • Educational submodule includes train_simple_encoding() and visualization helpers

Caveats

  • Performance benchmark is from tiktoken 0.2.0 vs. transformers 4.24.0 — versions have moved on since
  • The extension mechanism warns against editable installs and private-attribute access, suggesting the API is still settling

Verdict

Essential if you’re building cost estimators, prompt compressors, or anything that needs to know token counts before hitting OpenAI’s API. Skip it if you’re only using open-weight models with different vocabularies — this is tightly coupled to OpenAI’s encodings.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.