A field guide to German NLP: 522 stars, zero code
Someone finally catalogued the chaos of German-language NLP resources so you don't have to hunt through CLARIN portals at 2am.

What it does
This is a curated awesome-list of open-access German NLP resources—corpora, tools, models, and datasets—maintained with a deliberate bias toward things that actually work without a PhD in computational linguistics. Think of it as a well-organized bookmark folder for a niche where “just use English” has been the default answer for too long.
The interesting bit
The curation philosophy is unusually honest: usability over completeness, actively maintained over historically significant. The list also covers genuinely hard problems specific to German—historical text variants from 750–1800, Swiss German dialects, learner error corpora—that English-centric NLP benchmarks simply ignore.
Key highlights
- Corpus breadth: General web corpora, parliamentary records (GermaParl, GerParCor), legal texts, fashion image descriptions (Feidegger), even a football linguistics corpus
- Historical depth: Dedicated section spanning Old High German (750–1050) to Early New High German (1350–1650), with reference corpora for each period
- Swiss German: Separate subsection for dialect resources including ArchiMob and SMS corpora—low-resource even within a “high-resource” language
- Practical tooling: Data acquisition tools (trafilatura, news-please, german-reddit) and preprocessing pipelines (tokenization, lemmatization, POS tagging) listed alongside academic resources
- Community maintenance: Explicit call for PRs to keep the list current; contributing guidelines provided
Caveats
- The README is a table of contents with links; there are no quality ratings, compatibility notes, or last-updated timestamps for individual entries
- “Currently maintained” is a stated criterion but not verified in practice—some linked resources may be stale
Verdict
Worth bookmarking if you’re building German NLP pipelines, training models on historical German, or need to justify to your PM why Swiss German requires separate handling. Skip if you’re looking for runnable code or comparative benchmarks; this is pure reference material, well-organized but not evaluated.