← all repositories
dccuchile/beto

BERT learns Spanish, beats Google at its own game

A Chilean research group trained a monolingual Spanish BERT that outperforms Multilingual BERT on several NLP benchmarks—using Whole Word Masking and a 31k SentencePiece vocabulary.

504 stars Language Models
beto
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

BETO is a BERT-Base-sized model trained from scratch on a large Spanish corpus. It comes in cased and uncased flavors, uses Whole Word Masking, and drops straight into the HuggingFace Transformers ecosystem via two model IDs. The training ran for 2M steps with a ~31k BPE subword vocabulary built with SentencePiece.

The interesting bit

The name is a pun—“Beto” sounds like “BErT en españOl” and also nods to a common Spanish nickname. More substantively, the benchmarks show a monolingual model beating Google’s Multilingual BERT on POS tagging, NER, and XNLI, sometimes by substantial margins. The cased model wins on POS and NER; the uncased model takes MLDoc. It’s a clean demonstration that language-specific pretraining still matters even when multilingual models exist.

Key highlights

  • Available via HuggingFace as dccuchile/bert-base-spanish-wwm-cased and ...uncased
  • Trained with Whole Word Masking (Spanish-specific, not just multilingual BERT repackaged)
  • Benchmarks cover POS, NER, document classification (MLDoc), paraphrase detection (PAWS-X), and NLI (XNLI)
  • Includes a Colab notebook for quick experimentation
  • Academic paper published at PML4DC @ ICLR 2020

Caveats

  • The license situation is murky: the authors want CC BY 4.0 but explicitly warn that the training data’s original licenses may not permit commercial use
  • Benchmark table is frozen as of October 2019; no updates shown for newer multilingual competitors
  • PAWS-X results lag Multilingual BERT (89.05 vs 90.70), so it’s not a clean sweep

Verdict

Worth grabbing if you’re doing Spanish NLP and want a battle-tested monolingual encoder without training your own. Skip if you need guaranteed commercial licensing clarity or if you’re already committed to newer architectures beyond BERT.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.