An audio labeler that isn't CVAT, but is built on it
Audino wraps CVAT's backend with a React frontend to make speech annotation slightly less painful.

What it does Audino v2.0 is a browser-based tool for annotating audio — think transcription, speaker diarization, voice activity detection, and emotion tagging. It packages a React frontend around a CVAT-derived backend, deploys via Docker Compose, and exports data in formats meant to play nice with downstream ML pipelines.
The interesting bit
The project is essentially a specialized skin over CVAT’s infrastructure. The README’s development guide has you installing CVAT dependencies, running cvat_server, and browsing CVAT docs — which is either pragmatic reuse or an admission that building annotation UIs from scratch is a slog. The emoji support in labels is a small but humanizing touch in an otherwise utilitarian space.
Key highlights
- Docker-first deployment; local dev requires Ubuntu 22.04/20.04, Python 3.10+, Node 20, and patience
- User-level project/task/job hierarchy with role-based access (superuser setup required out of the box)
- Multi-language and emoji-capable labels
- Sponsored by Human Protocol, which uses it as an annotation service layer
- CC BY-NC 4.0 license — commercial use needs a conversation
Caveats
- v2.0 is “actively under development” and the migration from the original Audino is incomplete
- New users register with zero permissions by default; admin intervention is required before anyone can view tasks
- The README’s feature list is vague on which export formats are actually supported
Verdict Worth a look if you need a self-hosted audio annotation layer and already tolerate CVAT’s complexity. Skip it if you want something that works out of the box for non-technical annotators, or if the non-commercial license is a dealbreaker.