Is checklist open source?

Yes — marcotcr/checklist is open source, released under the MIT license.

What language is checklist written in?

marcotcr/checklist is primarily written in Jupyter Notebook.

How popular is checklist?

marcotcr/checklist has 2k stars on GitHub.

Where can I find checklist?

marcotcr/checklist is on GitHub at https://github.com/marcotcr/checklist.

← all repositories

marcotcr/checklist

Synthetic Unit Tests for NLP Models, Courtesy of RoBERTa

CheckList generates synthetic behavioral tests for NLP models to surface failure modes that aggregate accuracy metrics routinely hide.

★2k stars Jupyter Notebook LLMOps · Eval

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

CheckList is a behavioral testing framework for NLP models. It generates synthetic test cases through fill-in-the-blank templates and masked language model suggestions, then runs structured test suites to expose specific failure modes that a single accuracy score would miss.

The interesting bit

The framework treats model evaluation like software QA: you define expectations and generate inputs to stress-test them. It even uses RoBERTa and multilingual BERT variants as a creative partner to suggest test phrases, turning a language model into an ad-hoc test engineer.

Key highlights

Generates test data via templates and masked-LM suggestions (RoBERTa, XLM-RoBERTa, FlauBERT, German BERT).
Ships with ready-made test suites and pre-computed predictions for sentiment analysis, QQP, and SQuAD.
Includes multilingual lexicons for names and locations sourced from Wikidata, though with noted Wikipedia bias.
Supports INV and DIR tests through built-in data perturbation tools.
Interactive visualizations and suggestion widgets require classic Jupyter Notebook; they do not work in JupyterLab or Colab.

Caveats

Interactive visualizations are ipywidgets that break outside classic Jupyter Notebook.
Multilingual suggestion quality varies; the authors explicitly note they “can’t vouch” for non-English output.
Built-in lexicons carry a bias toward Wikipedia-notable names and locations.

Verdict

NLP practitioners and researchers who want to move beyond leaderboard accuracy and systematically probe for model failures should look here. If you are looking for a standard training or inference pipeline, this is not it.

Frequently asked

What is marcotcr/checklist?: CheckList generates synthetic behavioral tests for NLP models to surface failure modes that aggregate accuracy metrics routinely hide.
Is checklist open source?: Yes — marcotcr/checklist is open source, released under the MIT license.
What language is checklist written in?: marcotcr/checklist is primarily written in Jupyter Notebook.
How popular is checklist?: marcotcr/checklist has 2k stars on GitHub.
Where can I find checklist?: marcotcr/checklist is on GitHub at https://github.com/marcotcr/checklist.