Stream's toolkit for agents that actually watch the video feed
A Python framework for wiring real-time video, voice, and vision models into interactive agents with sub-30ms latency.

What it does
Vision Agents is Stream’s open-source Python kit for building multi-modal AI agents that process live video, audio, and text in real time. It handles WebRTC streaming, pluggable computer vision pipelines (YOLO, Roboflow, custom PyTorch/ONNX), speech-to-text and text-to-speech, turn detection, tool calling via MCP, and phone integration through Twilio. The stack includes a built-in HTTP server, Prometheus metrics, and Kubernetes deployment configs.
The interesting bit
The framework treats video as a first-class pipeline stage, not an afterthought. You can run pose detection or object segmentation frames ahead of the LLM call, then feed those annotations into Gemini Live or OpenAI Realtime as structured context. The “text back-channel” is a nice touch: silent coaching messages injected mid-call without interrupting the voice flow.
Key highlights
- Native SDK wrappers for OpenAI (
create response), Gemini (generate), and Claude (create message) — no abstraction lag on new model features - Claims 500ms join time and <30ms A/V latency over Stream’s edge network; also works with other video infrastructure
- 333,000 free participant-minutes/month via Stream’s Maker Program
- Out-of-the-box integrations span 6+ LLM providers, 5+ realtime APIs, 6+ STT engines, 7+ TTS voices, and vision models including Ultralytics, Moondream, and NVIDIA Cosmos
- Cross-platform client SDKs: React, Android, iOS, Flutter, React Native, Unity
Caveats
- The “works with any video edge network” claim is stated but thinly documented; most examples assume Stream’s infrastructure
- Heavy integration surface means version drift across providers is a real maintenance risk
Verdict
Worth a look if you’re building real-time coaching, moderation, or interactive video experiences and want to avoid gluing WebRTC, CV pipelines, and LLM APIs from scratch. Skip it if your use case is batch video analysis or you need to avoid vendor-specific infrastructure entirely.