karpathy/minbpe
Minimal, clean implementation of Byte Pair Encoding for LLM tokenization.

Velocity · 7d
+13
★ / day
Trend
→steady
star history
This repository provides clean, minimal code implementing the Byte Pair Encoding algorithm for tokenization. It includes multiple tokenizer implementations (BasicTokenizer, RegexTokenizer, and GPT4Tokenizer) that handle the three primary functions: training vocabulary and merges, encoding text to tokens, and decoding tokens back to text. The code directly supports modern LLM tokenization workflows as popularized by GPT-2 and continued through GPT-4.