Is deepeval open source?

Yes — confident-ai/deepeval is open source, released under the Apache-2.0 license.

What language is deepeval written in?

confident-ai/deepeval is primarily written in Python.

How popular is deepeval?

confident-ai/deepeval has 17.1k stars on GitHub and is currently accelerating.

Where can I find deepeval?

confident-ai/deepeval is on GitHub at https://github.com/confident-ai/deepeval.

← all repositories

confident-ai/deepeval

Pytest for LLMs, with opinions on hallucination

An open-source evaluation framework that treats LLM quality as a testable property, not a vibe.

★17.1k stars Python LLMOps · Eval

View on GitHub ↗ Homepage ↗

Velocity · 7d

+26

★ / day

Trend

↗accelerating

star history

What it does DeepEval is a Python testing framework for LLM applications. It wraps research-backed metrics—hallucination detection, RAG faithfulness, agent task completion, bias checks—into assert-style tests you can run in CI. Think Pytest, but your “unit” is a prompt-response pair and your “oracle” is another LLM or an on-device NLP model.

The interesting bit The breadth is almost comically thorough. It covers not just single-turn Q&A but multi-turn conversation memory, MCP server usage, and even multimodal coherence (text-to-image alignment). The DAG metric builder is a nice touch: deterministic graphs for judgment criteria that don’t drift with temperature settings. And yes, it runs local models if you don’t want your eval budget to exceed your inference budget.

Key highlights

20+ ready-made metrics across RAG, agents, chatbots, and multimodal pipelines
LLM-as-a-judge with any provider, plus local NLP models for cost-sensitive runs
Synthetic dataset generation for when your ground truth is mostly hope
Integrations with OpenAI Agents, LangChain, LangGraph, Pydantic AI, CrewAI, Anthropic
Benchmarking wrapper for MMLU, HellaSwag, HumanEval, GSM8K, etc.
Optional hosted platform (Confident AI) for team dashboards and iteration comparison

Caveats

The “under 10 lines of code” benchmark claims are hard to verify without trying; the README is confident but light on actual code samples
Heavy upsell to the paid Confident AI platform throughout the docs
Some newer metric categories (MCP, multimodal) have sparse real-world battle scars compared to the core RAG/agent set

Verdict Worth a look if you’re shipping LLM features and currently evaluating quality by “seems fine in Slack.” Skip if you need rigorous, human-validated ground truth out of the box—DeepEval gives you the scaffolding, not the gospel.

Frequently asked

What is confident-ai/deepeval?: An open-source evaluation framework that treats LLM quality as a testable property, not a vibe.
Is deepeval open source?: Yes — confident-ai/deepeval is open source, released under the Apache-2.0 license.
What language is deepeval written in?: confident-ai/deepeval is primarily written in Python.
How popular is deepeval?: confident-ai/deepeval has 17.1k stars on GitHub and is currently accelerating.
Where can I find deepeval?: confident-ai/deepeval is on GitHub at https://github.com/confident-ai/deepeval.