Turkish NLP's model zoo, built by committee
Community-sourced data, a crowdsourced name, and more transformer variants than you can shake a kebab at.

What it does
BERTurk is a family of pre-trained Turkish language models—BERT, DistilBERT, ELECTRA, ConvBERT, and T5 variants—trained on filtered web corpora, Wikipedia, and community-contributed datasets. Everything is hosted on Hugging Face and benchmarked on standard Turkish NLP tasks.
The interesting bit
The project is genuinely community-driven: the training data, the “BERTurk” name, and even the logo (by Merve Noyan) came from the Turkish NLP community rather than a single lab. The README also doubles as a changelog stretching back to 2020, which makes the evolution oddly transparent—you can watch the model zoo grow from a single BERT checkpoint to a 1.42B-parameter T5 variant trained on FineWeb2.
Key highlights
- 13 model variants with training corpus sizes from 7GB (distilled) to 262GB (BERT5urk)
- Two vocab sizes for BERT models: standard 32k and expanded 128k
- ELECTRA and ConvBERT models trained on both the original 35GB corpus and the larger mC4 (242GB)
- BERT5urk uses the UL2 objective in T5X for 2M steps on a v3-32 TPU pod
- Evaluation tables with actual numbers: PoS tagging accuracy in the 93-95% range across variants
Caveats
- The NER and sentiment sections are truncated in the provided README, so downstream performance beyond PoS tagging is unclear
- No explicit comparison to non-community Turkish models (e.g., from major cloud providers)
Verdict
Worth bookmarking if you work on Turkish NLP and want battle-tested, openly documented baselines. Skip if you need multilingual coverage—this is Turkish-only by design.