Korean NLP finally gets its GLUE
A proper benchmark for Korean language models, because comparing models on vibes wasn't cutting it anymore.

What it does
KLUE is an 8-task benchmark for Korean NLP—think GLUE or SuperGLUE, but for a language that existing benchmarks mostly ignore. It ships with datasets, evaluation metrics, fine-tuning recipes, and two pretrained models (KLUE-BERT and KLUE-RoBERTa) so you can actually reproduce baselines instead of guessing.
The interesting bit
The project was built with unusual care for a benchmark: explicit design principles around accessibility, annotation quality, and even AI ethics. The baseline table is refreshingly honest—KLUE’s own models don’t sweep every category, and you can see exactly where XLM-R-large still wins or where koELECTRA edges them out.
Key highlights
- 8 tasks covering classification, similarity, inference, NER, relation extraction, dependency parsing, reading comprehension, and dialogue state tracking
- Pretrained models on Hugging Face Hub in four sizes (including a deliberately small RoBERTa for resource-constrained work)
- CC BY-SA 4.0 license—actually open, not “open” with a 47-page clickthrough
- Active leaderboard with submission guidelines
- Backed by a small consortium of Korean industry and academia (Upstage, NAVER, KAIST, NYU, etc.)
Caveats
- The README is sparse on dataset construction details; the paper is the real source of truth
- No code visible in the repo itself—this appears to be a documentation and results hub, not an implementation
Verdict
Worth bookmarking if you work on Korean NLP or need to evaluate multilingual models fairly on Korean. Skip it if you’re looking for novel architectures or training code—this is infrastructure, not invention.