Finally, a Python NER evaluator that doesn't require Perl
seqeval replaces the venerable conlleval script with native Python metrics for sequence labeling tasks.

What it does
seqeval computes standard classification metrics—accuracy, precision, recall, F1, and full reports—for sequence labeling tasks like named-entity recognition and POS tagging. It accepts the same list-of-lists format you’d already be using for token-level labels.
The interesting bit
The library has two personalities. “Default” mode deliberately mimics the original Perl conlleval script, warts and all, so your numbers stay comparable to twenty years of published NER papers. “Strict” mode actually validates against tagging schemes (IOB2, IOBES, BILOU, etc.), catching invalid sequences that default mode would silently score as correct. The README’s minimal example is telling: a prediction starting with I-NP instead of B-NP scores perfect 1.00 in default mode and 0.00 in strict mode.
Key highlights
- Drop-in
sklearn-style API:f1_score(y_true, y_pred)andclassification_report() - Six tagging schemes supported, though IOBES and BILOU only work in strict mode
- Self-described as “well-tested” against the original Perl conlleval
- One-line install:
pip install seqeval
Caveats
- The README doesn’t specify how the “well-tested” claim was validated—no test coverage stats, no continuous integration badges
- Strict mode requires you to pass both
mode='strict'and aschemeargument; forget one and it presumably falls back to default behavior
Verdict
Anyone training NER or chunking models in Python who needs conlleval-compatible numbers without spawning a Perl process. If you’re doing non-bio sequence tasks or already have a working evaluation pipeline, this is just another dependency.