← all repositories

karpathy/minbpe

Minimal, clean implementation of Byte Pair Encoding for LLM tokenization.

10.5k stars Python Language Models
minbpe
Velocity · 7d
+13
★ / day
Trend
steady
star history

This repository provides clean, minimal code implementing the Byte Pair Encoding algorithm for tokenization. It includes multiple tokenizer implementations (BasicTokenizer, RegexTokenizer, and GPT4Tokenizer) that handle the three primary functions: training vocabulary and merges, encoding text to tokens, and decoding tokens back to text. The code directly supports modern LLM tokenization workflows as popularized by GPT-2 and continued through GPT-4.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.