← all repositories
adbar/German-NLP

A field guide to German NLP: 522 stars, zero code

Someone finally catalogued the chaos of German-language NLP resources so you don't have to hunt through CLARIN portals at 2am.

German-NLP
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

This is a curated awesome-list of open-access German NLP resources—corpora, tools, models, and datasets—maintained with a deliberate bias toward things that actually work without a PhD in computational linguistics. Think of it as a well-organized bookmark folder for a niche where “just use English” has been the default answer for too long.

The interesting bit

The curation philosophy is unusually honest: usability over completeness, actively maintained over historically significant. The list also covers genuinely hard problems specific to German—historical text variants from 750–1800, Swiss German dialects, learner error corpora—that English-centric NLP benchmarks simply ignore.

Key highlights

  • Corpus breadth: General web corpora, parliamentary records (GermaParl, GerParCor), legal texts, fashion image descriptions (Feidegger), even a football linguistics corpus
  • Historical depth: Dedicated section spanning Old High German (750–1050) to Early New High German (1350–1650), with reference corpora for each period
  • Swiss German: Separate subsection for dialect resources including ArchiMob and SMS corpora—low-resource even within a “high-resource” language
  • Practical tooling: Data acquisition tools (trafilatura, news-please, german-reddit) and preprocessing pipelines (tokenization, lemmatization, POS tagging) listed alongside academic resources
  • Community maintenance: Explicit call for PRs to keep the list current; contributing guidelines provided

Caveats

  • The README is a table of contents with links; there are no quality ratings, compatibility notes, or last-updated timestamps for individual entries
  • “Currently maintained” is a stated criterion but not verified in practice—some linked resources may be stale

Verdict

Worth bookmarking if you’re building German NLP pipelines, training models on historical German, or need to justify to your PM why Swiss German requires separate handling. Skip if you’re looking for runnable code or comparative benchmarks; this is pure reference material, well-organized but not evaluated.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.