Microsoft's battle-tested entity parser you probably already use
Extract numbers, dates, units, and sequences from messy human text across 14 languages — the same engine behind LUIS and Bot Framework.

What it does
Microsoft.Recognizers.Text turns unstructured text into structured entities: cardinal numbers, ordinals, percentages, currency, dimensions, temperatures, ages, date/time expressions, plus sequences like emails, URLs, phone numbers, and GUIDs. It handles the messy reality of human language — “next Tuesday at 3pm”, “fifty bucks”, “1.5 meters” — and normalizes them into machine-readable form.
The interesting bit
This isn’t a research prototype quietly rusting in a repo. It’s the actual extraction engine powering LUIS, Power Virtual Agents, Microsoft Bot Framework, and Text Analytics Cognitive Service. The project ships for .NET, JavaScript/TypeScript, Python (alpha), and Java (in progress), with .NET as the primary target where new features land first.
Key highlights
- Full support for 10 languages: Chinese, English, French, Spanish, Portuguese, German, Italian, Turkish, Hindi, Dutch
- Partial support for Japanese, Korean, Arabic, Swedish; Bulgarian has boolean support only
- 15 entity types with varying depth — from generic regex sequences (emails, GUIDs) to fully resolved date/time with subtypes
- NuGet, NPM, and PyPI packages available; academic citation BibTeX provided (a nice touch for the rare open-source project that expects to be cited)
- Active contribution paths: open issues,
NotSupportedspec cases, and translating English test specs to new languages
Caveats
- Support matrix is genuinely lopsided: Korean DateTime is “specs-only” (tests written, code pending); Arabic units are entirely unsupported; Swedish phone numbers are a no-go
- Python and Java ports lag behind .NET, so cross-platform parity is aspirational, not guaranteed
- README warns that contribution guides “may have become a little out-of-date”
Verdict
Grab this if you’re building chatbots, parsing forms, or doing any NLP that needs reliable entity extraction without training your own model. Skip it if you need bleeding-edge language coverage (especially Arabic, Korean, or Bulgarian) or if you require guaranteed parity across all four platform ports.