BERT learns Spanish, beats Google at its own game
A Chilean research group trained a monolingual Spanish BERT that outperforms Multilingual BERT on several NLP benchmarks—using Whole Word Masking and a 31k SentencePiece vocabulary.

What it does
BETO is a BERT-Base-sized model trained from scratch on a large Spanish corpus. It comes in cased and uncased flavors, uses Whole Word Masking, and drops straight into the HuggingFace Transformers ecosystem via two model IDs. The training ran for 2M steps with a ~31k BPE subword vocabulary built with SentencePiece.
The interesting bit
The name is a pun—“Beto” sounds like “BErT en españOl” and also nods to a common Spanish nickname. More substantively, the benchmarks show a monolingual model beating Google’s Multilingual BERT on POS tagging, NER, and XNLI, sometimes by substantial margins. The cased model wins on POS and NER; the uncased model takes MLDoc. It’s a clean demonstration that language-specific pretraining still matters even when multilingual models exist.
Key highlights
- Available via HuggingFace as
dccuchile/bert-base-spanish-wwm-casedand...uncased - Trained with Whole Word Masking (Spanish-specific, not just multilingual BERT repackaged)
- Benchmarks cover POS, NER, document classification (MLDoc), paraphrase detection (PAWS-X), and NLI (XNLI)
- Includes a Colab notebook for quick experimentation
- Academic paper published at PML4DC @ ICLR 2020
Caveats
- The license situation is murky: the authors want CC BY 4.0 but explicitly warn that the training data’s original licenses may not permit commercial use
- Benchmark table is frozen as of October 2019; no updates shown for newer multilingual competitors
- PAWS-X results lag Multilingual BERT (89.05 vs 90.70), so it’s not a clean sweep
Verdict
Worth grabbing if you’re doing Spanish NLP and want a battle-tested monolingual encoder without training your own. Skip if you need guaranteed commercial licensing clarity or if you’re already committed to newer architectures beyond BERT.