← all repositories
PacificAI/langtest

One-liner stress tests for language models

LangTest wraps 60+ model evaluation patterns into a single Python call, from toxicity checks to sycophancy detection.

560 stars Python LLMOps · EvalLanguage Models
langtest
Velocity · 7d
+0.4
★ / day
Trend
steady
collecting data…
star history

What it does LangTest is a Python toolkit that generates, runs, and reports on model tests through a Harness object. You instantiate it with a task (NER, QA, summarization, etc.) and a model source—Hugging Face, OpenAI, Azure, Spark NLP—then call generate().run().report() to surface failures in robustness, bias, fairness, or factuality. It also bundles curated benchmark datasets and can augment training data for some model types based on test results.

The interesting bit The breadth is the point: instead of cobbling together separate tools for toxicity, sycophancy, clinical safety, and stereotype bias, LangTest treats them as test categories in the same harness. That makes it feasible to run a battery of checks on a new model without writing custom evaluation code for each failure mode.

Key highlights

  • 60+ test types accessible through one API call
  • Supports both traditional NLP (NER, text classification, translation) and LLM evaluation (OpenAI, Cohere, AI21, Azure-OpenAI)
  • Built-in benchmark datasets for QA, summarization, and bias testing
  • Automated data augmentation for select models based on test failures
  • Integrates with MLflow for experiment tracking

Caveats

  • The “one line of code” claim is technically true for the happy path, but real evaluation likely requires tuning test configurations and interpreting category-specific reports
  • Data augmentation is noted as “for select models”—which ones is not specified in the README
  • 559 stars suggests adoption is still modest; longevity depends on Pacific AI / John Snow Labs continued maintenance

Verdict Worth a look if you’re shipping LLMs or NLP pipelines and need a structured way to check for bias, safety, and robustness without building an evaluation framework from scratch. Less useful if you already have bespoke eval infrastructure or need deep customization beyond the built-in test catalog.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.