Multi-person mocap from phones, no suits required
MAMMA turns synchronized multi-view video into articulated SMPL-X body meshes for several people at once, using off-the-shelf cameras instead of motion-capture suits.

What it does
MAMMA is a complete pipeline that takes calibrated multi-view footage and produces per-person 3D body shapes and poses. It segments people with SAM and YOLO, detects 2D landmarks with its own MammaNet, then fits SMPL-X models across camera views. The output is an articulated mesh sequence you can inspect in an interactive 3D viewer or overlay back onto source frames.
The interesting bit
The project ships as a fully packaged product rather than a loose collection of scripts: it includes a browser-based GUI, preset pipeline configurations, and even an iPhone-outdoor calibration profile. That suggests the authors actually want people to run it on their own footage, not just reproduce a paper figure.
Key highlights
- End-to-end pipeline from raw video to
SMPL-Xmeshes: capture, segmentation, 2D landmarks, multi-view optimization, and visualization. - Built-in web UI (Flask + React) for submitting runs and inspecting results without touching a terminal.
- Ships with a 4-camera example dataset and preset configs for quick smoke tests versus full-frame processing.
- Training and inference code both released; datasets (dance, multi-person, iPhone, synthetic) available behind a free registration wall.
- CVPR 2026 Oral.
Caveats
- Evaluation scripts and processed benchmark data are still on the TODO list, so quantitative comparisons aren’t yet possible out of the box.
- License is strictly non-commercial research; product teams need not apply.
- You must supply camera calibration and synchronized multi-view footage—this isn’t monocular magic.
Verdict
Computer-vision researchers and animators with a camera rig should take a close look. If you’re hoping to drop a single GoPro video and get studio-grade data, this isn’t your tool.