MedCAT: NLP for health records that actually ships models
A Python toolkit that extracts medical concepts from clinical text and links them to SNOMED-CT and UMLS—pre-trained, license-warts and all.

What it does
MedCAT runs named entity recognition on electronic health records, then maps the extracted terms to biomedical ontologies like SNOMED-CT and UMLS. It comes with four public model packs (including a >4M concept UMLS Full model trained on MIMIC-III), so you’re not starting from zero. The project has since moved to CogStack/cogstack-nlp, with MedCAT v2 incoming.
The interesting bit
The maintainers ship actual downloadable models—not just architecture—which is rarer than it should be in medical NLP. The catch: you need NIH/UMLS credentials to get them, and the project now uses the Elastic License 2.0, which is not OSI-approved. The Dutch model pack even bundles a separate negation detection model, suggesting the tool handles clinical nuance beyond bare extraction.
Key highlights
- Pre-trained models for UMLS (small and full) and SNOMED International, plus a Dutch variant
- Built on spaCy v3 with optional Hugging Face Transformers integration
- CPU-only install path available (saves ~10 GB vs. default GPU dependencies)
- Live demo trained on full SNOMED-CT + MIMIC-III (spins up on demand, so first load is slow)
- Logging disabled by default—library users control their own noise
Caveats
- Repository is deprecated; active development moved to CogStack/cogstack-nlp
- MedCAT v1.16.x is the final v1 release; v2 is “soon” per the README, with no date
- Model downloads require UMLS/NIH authentication—no anonymous grab-and-go
Verdict
Worth a look if you’re doing clinical NLP and need ontology-linked output without training from scratch. Skip if you need fully open licensing or can’t navigate UMLS credentialing. Check the new repo first—this one’s a redirect with history.