A test kitchen for teaching machines to read faces, voices, and text
MMSA corrals 18 sentiment-analysis models into one pip-installable framework so you can stop rewriting boilerplate and start arguing about which fusion architecture actually matters.

What it does
MMSA is a Python framework that trains and benchmarks multimodal sentiment-analysis models. You feed it video clips (or pre-extracted features), and it handles the plumbing for 18 different architectures—everything from 2017’s Tensor Fusion Network to 2023’s ALMT. It supports three datasets out of the box: MOSI, MOSEI, and the Chinese CH-SIMS. You can run it via a one-liner Python API, a command-line tool, or by cloning and hacking the source directly.
The interesting bit
The real value isn’t any single model; it’s the standardization. MMSA forces every architecture into the same feature format and evaluation loop, so you can actually compare TFN against a transformer-based MulT without debugging six different data loaders. They even ship SHA-256 checksums for the pre-extracted feature files, which is the kind of rigor you rarely see in academic code releases.
Key highlights
- 18 models supported, split cleanly between single-task and multi-task variants (including several from the authors’ own ACL/AAAI papers)
- Three datasets with pre-extracted BERT text features, audio, and vision features available via Baidu or Google Drive
pip install MMSAand go; or clone, edit, and reinstall locally- Companion toolkit MMSA-FET for extracting custom multimodal features if you want to move beyond the provided pickles
- Version 2.0 is PyPI-packaged; a
v_1.0branch remains for those who preferred the old layout
Caveats
- BBFN is marked “Work in Progress” in the model table
- The README notes classification labels are deprecated as of v2.0; regression labels are the path forward, though this isn’t explained in detail
- Re-installing after local edits requires an explicit
pip uninstallcycle, which feels clunky
Verdict
Grab this if you’re doing research in multimodal sentiment analysis and need a sane baseline to beat. Skip it if you’re looking for end-to-end video processing—MMSA expects pre-extracted features, not raw pixels and waveforms.