← all repositories
microsoft/Recognizers-Text

Microsoft's battle-tested entity parser you probably already use

Extract numbers, dates, units, and sequences from messy human text across 14 languages — the same engine behind LUIS and Bot Framework.

Recognizers-Text
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

Microsoft.Recognizers.Text turns unstructured text into structured entities: cardinal numbers, ordinals, percentages, currency, dimensions, temperatures, ages, date/time expressions, plus sequences like emails, URLs, phone numbers, and GUIDs. It handles the messy reality of human language — “next Tuesday at 3pm”, “fifty bucks”, “1.5 meters” — and normalizes them into machine-readable form.

The interesting bit

This isn’t a research prototype quietly rusting in a repo. It’s the actual extraction engine powering LUIS, Power Virtual Agents, Microsoft Bot Framework, and Text Analytics Cognitive Service. The project ships for .NET, JavaScript/TypeScript, Python (alpha), and Java (in progress), with .NET as the primary target where new features land first.

Key highlights

  • Full support for 10 languages: Chinese, English, French, Spanish, Portuguese, German, Italian, Turkish, Hindi, Dutch
  • Partial support for Japanese, Korean, Arabic, Swedish; Bulgarian has boolean support only
  • 15 entity types with varying depth — from generic regex sequences (emails, GUIDs) to fully resolved date/time with subtypes
  • NuGet, NPM, and PyPI packages available; academic citation BibTeX provided (a nice touch for the rare open-source project that expects to be cited)
  • Active contribution paths: open issues, NotSupported spec cases, and translating English test specs to new languages

Caveats

  • Support matrix is genuinely lopsided: Korean DateTime is “specs-only” (tests written, code pending); Arabic units are entirely unsupported; Swedish phone numbers are a no-go
  • Python and Java ports lag behind .NET, so cross-platform parity is aspirational, not guaranteed
  • README warns that contribution guides “may have become a little out-of-date”

Verdict

Grab this if you’re building chatbots, parsing forms, or doing any NLP that needs reliable entity extraction without training your own model. Skip it if you need bleeding-edge language coverage (especially Arabic, Korean, or Bulgarian) or if you require guaranteed parity across all four platform ports.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.