Teaching BERT Mandarin with 32 examples and a prayer
A Chinese benchmark that measures how well language models learn when training data is basically a postcard.

What it does
FewCLUE is a Chinese few-shot learning benchmark with nine NLP tasks—from sentiment analysis to idiom reading comprehension—where models train on as few as 32 labeled examples. It ships with baselines for fine-tuning, PET, P-tuning, and zero-shot GPT evaluation, plus a public leaderboard. The project grew out of CLUE, the established Chinese NLP benchmark, but narrows the focus to the “not enough data” regime that mirrors real business constraints.
The interesting bit
The human baseline is brutal: annotators score 82.49 overall, while the best model limps to 54.34. The gap is nearly 30 points, and on some tasks the models barely beat random guessing. Yet zero-shot RoBERTa on a 119-class app-description task scores 27.7—just 2 points below fine-tuning with actual data. The benchmark is explicitly designed to be unstable and annoying: multiple sampled training splits, severe class imbalance, and 50+ category tasks that humans themselves only pass.
Key highlights
- Nine tasks spanning single-sentence classification, sentence pairs, and reading comprehension, with unlabeled data provided for semi-supervised experiments
- Human evaluation protocol: 30 training examples, group discussion, then 100-sample validation with majority voting
- Baselines cover fine-tuning, PET, P-tuning (RoBERTa and GPT variants), ADAPET, EFL, LM-BFF, and zero-shot for both RoBERTa and Chinese GPT
- Includes a live submission leaderboard and was the basis for an NLPCC 2021 shared task
- One-click run scripts for TensorFlow/Keras with
chinese_roberta_wwm_extas the default backbone
Caveats
- Setup is dated: TensorFlow 1.14+, Keras 2.3.1, and
bert4keras—not the modern PyTorch/Hugging Face stack most researchers use now - The README’s “UPDATE” section stops in July 2021; maintenance status is unclear
- CHID scores were excluded from final rankings “temporarily” in 2021 and may still be
Verdict
Worth a look if you’re doing Chinese low-resource NLP research or need a rigorous reality check on your prompt-tuning paper. Skip it if you want a plug-and-play Hugging Face dataset; this is academic infrastructure with 2021-era dependencies.