Motion capture from a webcam, no studio required
A pure C++ pipeline that turns monocular video into multi-person BVH files ready for Blender.

What it does
SAM3DBody-cpp ingests a single camera feed—webcam, video file, or image—and spits out per-person 3D body poses, 70 keypoints with hands, and optionally full mesh vertices. The standout output is standard BVH motion-capture files per detected person, identity-tracked across frames, with bone lengths auto-fitted to each actor. Drop p_0.bvh into Blender and animate a MakeHuman rig without touching a mocap suit.
The interesting bit
This is not 2D-to-3D lifting. A DINOv2-ViT-H backbone extracts features, a transformer decoder compresses them into a pose token, and tiny ggml FFN heads regress 519 body model parameters directly—no depth sensor, no stereo, no Python at runtime. The network learns metric depth from body proportions alone, same trick SMPL-family models use, but the entire inference engine is standalone C++ with ONNX Runtime and a native C linear blend skinning implementation.
Key highlights
- Zero Python dependency at runtime; Python frontends are thin ctypes wrappers around a compiled shared library
- Multi-person BVH export with IoU-based identity tracking and auto-resized joint offsets per actor
- ~4.8 GB DINOv3-ViT-H/14+ encoder (BF16, CUDA EP) plus YOLO11m-pose detector, 6-layer decoder, and ggml heads
- CPU fallback exists via float32 backbone, but a single forward pass takes 5–15 seconds on a modern laptop—impractical for video
- Bundled Blender plugin and CSV exporter for the 70 MHR keypoints
Caveats
- The standard backbone requires CUDA; CPU inference is technically possible but glacial for real-time use
body_model.onnxexport is blocked on PyTorch ≥ 2.x, so the native C LBS path is currently required for mesh output- CMake will warn if the ~5 GB of model files haven’t been downloaded from HuggingFace
Verdict
Worth a look if you need offline or real-time monocular mocap without Python infrastructure headaches. Skip it if you’re CPU-only and need speed, or if you require floor-plane recovery (that’s downstream).