← all repositories
relari-ai/continuous-eval

Stop eyeballing your RAG pipeline

A Python toolkit that breaks LLM evaluation into per-module metrics you can actually test in CI.

516 stars Python LLMOps · EvalRAG · Search
continuous-eval
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

continuous-eval is a Python package for evaluating LLM-powered applications by splitting them into discrete modules—retriever, reranker, generator—and attaching specific metrics to each. It ships with deterministic, semantic, and LLM-as-a-judge metrics for RAG, code generation, classification, and agent tool use. You define a pipeline graph, run evaluation across a dataset, and set threshold-based tests that can fail a build.

The interesting bit

Most evaluation frameworks treat the black box as a single unit. This one forces you to name each stage and extract its outputs explicitly—via ModuleOutput(page_content) or similar—so you can’t pretend a bad retriever is the LLM’s fault. The probabilistic metrics and custom LLM-judge builder are where it gets less conventional than standard accuracy/F1 reporting.

Key highlights

  • Modular pipeline evaluation: retriever → reranker → LLM, each with its own metrics and pass/fail tests
  • Metric library spans deterministic (Precision/Recall), semantic, and LLM-based judges
  • CustomMetric class for spinning up LLM-as-a-Judge evaluators with structured rubrics
  • EvaluationRunner handles parallel execution; results aggregate per-module and pipeline-wide
  • Example datasets and end-to-end examples repo available for bootstrapping

Caveats

  • LLM-based metrics require API keys; the README doesn’t specify which providers are supported beyond “at least one”
  • Multiprocessing requires wrapping code in if __name__ == "__main__" guards, which is noted but easy to miss
  • Telemetry is on by default; you must set CONTINUOUS_EVAL_DO_NOT_TRACK=true to disable

Verdict

Worth a look if you’re running RAG or multi-step LLM pipelines in production and want per-stage regression tests. Probably overkill if you’re just spot-checking a single prompt with OpenAI’s evals.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.