← all repositories
stefan-it/turkish-bert

Turkish NLP's model zoo, built by committee

Community-sourced data, a crowdsourced name, and more transformer variants than you can shake a kebab at.

576 stars Python Language Models
turkish-bert
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

BERTurk is a family of pre-trained Turkish language models—BERT, DistilBERT, ELECTRA, ConvBERT, and T5 variants—trained on filtered web corpora, Wikipedia, and community-contributed datasets. Everything is hosted on Hugging Face and benchmarked on standard Turkish NLP tasks.

The interesting bit

The project is genuinely community-driven: the training data, the “BERTurk” name, and even the logo (by Merve Noyan) came from the Turkish NLP community rather than a single lab. The README also doubles as a changelog stretching back to 2020, which makes the evolution oddly transparent—you can watch the model zoo grow from a single BERT checkpoint to a 1.42B-parameter T5 variant trained on FineWeb2.

Key highlights

  • 13 model variants with training corpus sizes from 7GB (distilled) to 262GB (BERT5urk)
  • Two vocab sizes for BERT models: standard 32k and expanded 128k
  • ELECTRA and ConvBERT models trained on both the original 35GB corpus and the larger mC4 (242GB)
  • BERT5urk uses the UL2 objective in T5X for 2M steps on a v3-32 TPU pod
  • Evaluation tables with actual numbers: PoS tagging accuracy in the 93-95% range across variants

Caveats

  • The NER and sentiment sections are truncated in the provided README, so downstream performance beyond PoS tagging is unclear
  • No explicit comparison to non-community Turkish models (e.g., from major cloud providers)

Verdict

Worth bookmarking if you work on Turkish NLP and want battle-tested, openly documented baselines. Skip if you need multilingual coverage—this is Turkish-only by design.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.