Is FewCLUE open source?

Yes — CLUEbenchmark/FewCLUE is an open-source project tracked on heatdrop.

What language is FewCLUE written in?

CLUEbenchmark/FewCLUE is primarily written in Python.

How popular is FewCLUE?

CLUEbenchmark/FewCLUE has 517 stars on GitHub.

Where can I find FewCLUE?

CLUEbenchmark/FewCLUE is on GitHub at https://github.com/CLUEbenchmark/FewCLUE.

← all repositories

CLUEbenchmark/FewCLUE

Teaching BERT Mandarin with 32 examples and a prayer

A Chinese benchmark that measures how well language models learn when training data is basically a postcard.

★517 stars Python LLMOps · Eval Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

FewCLUE is a Chinese few-shot learning benchmark with nine NLP tasks—from sentiment analysis to idiom reading comprehension—where models train on as few as 32 labeled examples. It ships with baselines for fine-tuning, PET, P-tuning, and zero-shot GPT evaluation, plus a public leaderboard. The project grew out of CLUE, the established Chinese NLP benchmark, but narrows the focus to the “not enough data” regime that mirrors real business constraints.

The interesting bit

The human baseline is brutal: annotators score 82.49 overall, while the best model limps to 54.34. The gap is nearly 30 points, and on some tasks the models barely beat random guessing. Yet zero-shot RoBERTa on a 119-class app-description task scores 27.7—just 2 points below fine-tuning with actual data. The benchmark is explicitly designed to be unstable and annoying: multiple sampled training splits, severe class imbalance, and 50+ category tasks that humans themselves only pass.

Key highlights

Nine tasks spanning single-sentence classification, sentence pairs, and reading comprehension, with unlabeled data provided for semi-supervised experiments
Human evaluation protocol: 30 training examples, group discussion, then 100-sample validation with majority voting
Baselines cover fine-tuning, PET, P-tuning (RoBERTa and GPT variants), ADAPET, EFL, LM-BFF, and zero-shot for both RoBERTa and Chinese GPT
Includes a live submission leaderboard and was the basis for an NLPCC 2021 shared task
One-click run scripts for TensorFlow/Keras with chinese_roberta_wwm_ext as the default backbone

Caveats

Setup is dated: TensorFlow 1.14+, Keras 2.3.1, and bert4keras—not the modern PyTorch/Hugging Face stack most researchers use now
The README’s “UPDATE” section stops in July 2021; maintenance status is unclear
CHID scores were excluded from final rankings “temporarily” in 2021 and may still be

Verdict

Worth a look if you’re doing Chinese low-resource NLP research or need a rigorous reality check on your prompt-tuning paper. Skip it if you want a plug-and-play Hugging Face dataset; this is academic infrastructure with 2021-era dependencies.

Frequently asked

What is CLUEbenchmark/FewCLUE?: A Chinese benchmark that measures how well language models learn when training data is basically a postcard.
Is FewCLUE open source?: Yes — CLUEbenchmark/FewCLUE is an open-source project tracked on heatdrop.
What language is FewCLUE written in?: CLUEbenchmark/FewCLUE is primarily written in Python.
How popular is FewCLUE?: CLUEbenchmark/FewCLUE has 517 stars on GitHub.
Where can I find FewCLUE?: CLUEbenchmark/FewCLUE is on GitHub at https://github.com/CLUEbenchmark/FewCLUE.