One benchmark to fuse them all: 10 modalities, 20 tasks, 1 codebase
MultiBench tries to stop every multimodal paper from reinventing its own data loader and calling it a contribution.

What it does
MultiBench is a standardized benchmarking suite and code framework for multimodal deep learning. It bundles 15 datasets across affective computing, healthcare, robotics, finance, and HCI into a single automated pipeline with consistent data loading, training, and evaluation. The companion “MultiZoo” module provides 20 implemented algorithms—unimodal baselines, fusion paradigms (early, late, tensor-based), and training structures—designed to be composed and extended.
The interesting bit
The project explicitly targets three problems the README claims are understudied: generalization across domains, training/inference complexity, and robustness to missing or noisy modalities. Most benchmark papers measure accuracy and stop; MultiBench at least attempts to measure the boring stuff that actually matters in production.
Key highlights
- 15 datasets, 10 modalities (video, audio, text, time-series, tactile, etc.), 20 prediction tasks
- Modular algorithm zoo: swap fusion methods, objective functions, or training structures without rewriting scaffolding
- Automated pipeline standardizes preprocessing, experimental setup, and evaluation
- Extensible: add a dataset by writing one
get_dataloaderfunction; add an algorithm by dropping a module inunimodals/,fusions/,objective_functions/, ortraining_structures/ - Published at NeurIPS 2021 Datasets and Benchmarks; companion software paper in JMLR 2022
Caveats
- Several datasets require manual credentialing or Google Drive downloads (MIMIC, MuJoCo Push) and the README notes automatic downloads “may fail for various reasons”
- The “robustness to noisy and missing modalities” is claimed as a design goal; the actual robustness evaluation details are not visible in the truncated README
Verdict
Worth a look if you’re doing multimodal research and tired of writing boilerplate. Skip it if you only care about one modality—this is explicitly built for cross-domain comparison, not single-task speed.