A plumbing kit for voice AI that actually handles the pipes
Pipecat wires together speech recognition, LLMs, and text-to-speech so you can build real-time conversational agents without drowning in WebRTC boilerplate.

What it does
Pipecat is a Python framework for building real-time voice and multimodal AI agents. It connects speech-to-text, LLMs, text-to-speech, and transport layers (WebRTC, WebSockets) into pipelines you can run locally or distribute across machines. The project also ships client SDKs for JavaScript, React, Swift, Kotlin, C++, and even ESP32.
The interesting bit
The framework treats each pipeline as an agent, then lets you compose them with handoffs, parallel fan-out, and sidecar workers over a shared bus. That’s the harder part of multi-agent voice systems — not the LLM call, but the orchestration of who speaks when, and how audio flows between services without adding perceptible latency.
Key highlights
- Broad service coverage: 18+ STT providers, 15+ TTS options, 10+ LLM integrations, and multiple transport layers
- Ecosystem includes structured conversation state management (Pipecat Flows), a UI component kit, CLI scaffolding, and a real-time debugger called Whisker
- Supports multimodal inputs including vision (example shows MoonDream integration)
pipecat init quickstartgenerates a working project in under a minute- 12,618 stars and active CI with codecov tracking
Caveats
- The README is heavy on ecosystem marketing and light on architectural specifics — you’ll need to dig into docs for pipeline internals
- “Ultra-low latency” is claimed but no concrete latency benchmarks are provided in the sources
Verdict
Worth a look if you’re building production voice agents and tired of hand-rolling WebRTC + STT + TTS glue. Skip it if you just need a simple chatbot wrapper around OpenAI’s realtime API — this is more framework than you need for that.