A benchmark that makes few-shot learning actually prove itself
Google Research built a meta-learning stress test from ten real-world datasets, because classifying 5 images of a dog isn't a career.

What it does
Meta-Dataset is a benchmark and data pipeline for few-shot learning: training models to classify new categories from a handful of examples. It bundles ten diverse visual datasets (ImageNet, Omniglot, Aircraft, Birds, Textures, QuickDraw, Fungi, VGG Flower, Traffic Signs, MSCOCO) into a single evaluation framework with standardized “episodes” — sampled tasks where you get, say, 5 examples of 5 new classes and must classify test images.
The repository includes the full data conversion pipeline, training scripts, and reference implementations for several meta-learning baselines (MAML, Prototypical Networks, Matching Networks) plus two follow-up methods: CrossTransformers (spatially-aware Transformer, SOTA on ImageNet-only training as of NeurIPS 2020) and FLUTE (a “universal template” approach with FiLM parameters, SOTA on train-on-all as of ICML 2021).
The interesting bit
Most few-shot benchmarks use a single dataset with held-out classes. Meta-Dataset forces models to generalize across datasets — a model trained on natural images must handle sketches, textures, or traffic signs. The leaderboard reveals this is hard: even strong methods collapse on out-of-distribution datasets. The project also tracks a subtle bug (#54) where Traffic Sign evaluation needed shuffled samples, suggesting the maintainers actually care about measurement integrity.
Key highlights
- TFDS-based input pipeline released for both original (MD-v1) and updated VTAB+MD (MD-v2) protocols
- Pre-trained checkpoints available for CrossTransformers (three variants) and FLUTE
- Leaderboard with confidence intervals and per-dataset breakdowns, not just aggregate scores
- Includes an introductory Jupyter notebook demonstrating episode sampling
- Code and configs preserved for arXiv v1; v2 reproduction in active development on
arxiv_v2_devbranch
Caveats
- Not an officially supported Google product; maintenance appears research-driven
- Instructions for reproducing arXiv v2 results are still in progress
- Heavy TensorFlow/TFDS dependency; PyTorch users are on their own for porting
Verdict
Worth your time if you’re doing meta-learning research and need a rigorous benchmark that punishes dataset overfitting. Skip it if you want plug-and-play few-shot learning for a product — this is a measurement tool, not a library.