Is Vision-Agents open source?

Yes — GetStream/Vision-Agents is open source, released under the Apache-2.0 license.

What language is Vision-Agents written in?

GetStream/Vision-Agents is primarily written in Python.

How popular is Vision-Agents?

GetStream/Vision-Agents has 8k stars on GitHub.

Where can I find Vision-Agents?

GetStream/Vision-Agents is on GitHub at https://github.com/GetStream/Vision-Agents.

← all repositories

GetStream/Vision-Agents

Stream's toolkit for agents that actually watch the video feed

A Python framework for wiring real-time video, voice, and vision models into interactive agents with sub-30ms latency.

★8k stars Python Agents Image · Video · Audio

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Vision Agents is Stream’s open-source Python kit for building multi-modal AI agents that process live video, audio, and text in real time. It handles WebRTC streaming, pluggable computer vision pipelines (YOLO, Roboflow, custom PyTorch/ONNX), speech-to-text and text-to-speech, turn detection, tool calling via MCP, and phone integration through Twilio. The stack includes a built-in HTTP server, Prometheus metrics, and Kubernetes deployment configs.

The interesting bit

The framework treats video as a first-class pipeline stage, not an afterthought. You can run pose detection or object segmentation frames ahead of the LLM call, then feed those annotations into Gemini Live or OpenAI Realtime as structured context. The “text back-channel” is a nice touch: silent coaching messages injected mid-call without interrupting the voice flow.

Key highlights

Native SDK wrappers for OpenAI (create response), Gemini (generate), and Claude (create message) — no abstraction lag on new model features
Claims 500ms join time and <30ms A/V latency over Stream’s edge network; also works with other video infrastructure
333,000 free participant-minutes/month via Stream’s Maker Program
Out-of-the-box integrations span 6+ LLM providers, 5+ realtime APIs, 6+ STT engines, 7+ TTS voices, and vision models including Ultralytics, Moondream, and NVIDIA Cosmos
Cross-platform client SDKs: React, Android, iOS, Flutter, React Native, Unity

Caveats

The “works with any video edge network” claim is stated but thinly documented; most examples assume Stream’s infrastructure
Heavy integration surface means version drift across providers is a real maintenance risk

Verdict

Worth a look if you’re building real-time coaching, moderation, or interactive video experiences and want to avoid gluing WebRTC, CV pipelines, and LLM APIs from scratch. Skip it if your use case is batch video analysis or you need to avoid vendor-specific infrastructure entirely.

Frequently asked

What is GetStream/Vision-Agents?: A Python framework for wiring real-time video, voice, and vision models into interactive agents with sub-30ms latency.
Is Vision-Agents open source?: Yes — GetStream/Vision-Agents is open source, released under the Apache-2.0 license.
What language is Vision-Agents written in?: GetStream/Vision-Agents is primarily written in Python.
How popular is Vision-Agents?: GetStream/Vision-Agents has 8k stars on GitHub.
Where can I find Vision-Agents?: GetStream/Vision-Agents is on GitHub at https://github.com/GetStream/Vision-Agents.