Where can I find beto?

dccuchile/beto is on GitHub at https://github.com/dccuchile/beto.

dccuchile/beto

BERT learns Spanish, beats Google at its own game

A Chilean research group trained a monolingual Spanish BERT that outperforms Multilingual BERT on several NLP benchmarks—using Whole Word Masking and a 31k SentencePiece vocabulary.

★502 stars Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

BETO is a BERT-Base-sized model trained from scratch on a large Spanish corpus. It comes in cased and uncased flavors, uses Whole Word Masking, and drops straight into the HuggingFace Transformers ecosystem via two model IDs. The training ran for 2M steps with a ~31k BPE subword vocabulary built with SentencePiece.

The interesting bit

The name is a pun—“Beto” sounds like “BErT en españOl” and also nods to a common Spanish nickname. More substantively, the benchmarks show a monolingual model beating Google’s Multilingual BERT on POS tagging, NER, and XNLI, sometimes by substantial margins. The cased model wins on POS and NER; the uncased model takes MLDoc. It’s a clean demonstration that language-specific pretraining still matters even when multilingual models exist.

Key highlights

Available via HuggingFace as dccuchile/bert-base-spanish-wwm-cased and ...uncased
Trained with Whole Word Masking (Spanish-specific, not just multilingual BERT repackaged)
Benchmarks cover POS, NER, document classification (MLDoc), paraphrase detection (PAWS-X), and NLI (XNLI)
Includes a Colab notebook for quick experimentation
Academic paper published at PML4DC @ ICLR 2020

Caveats

The license situation is murky: the authors want CC BY 4.0 but explicitly warn that the training data’s original licenses may not permit commercial use
Benchmark table is frozen as of October 2019; no updates shown for newer multilingual competitors
PAWS-X results lag Multilingual BERT (89.05 vs 90.70), so it’s not a clean sweep

Verdict

Worth grabbing if you’re doing Spanish NLP and want a battle-tested monolingual encoder without training your own. Skip if you need guaranteed commercial licensing clarity or if you’re already committed to newer architectures beyond BERT.

Frequently asked

What is dccuchile/beto?: A Chilean research group trained a monolingual Spanish BERT that outperforms Multilingual BERT on several NLP benchmarks—using Whole Word Masking and a 31k SentencePiece vocabulary.
Is beto open source?: Yes — dccuchile/beto is open source, released under the CC-BY-4.0 license.
How popular is beto?: dccuchile/beto has 502 stars on GitHub.
Where can I find beto?: dccuchile/beto is on GitHub at https://github.com/dccuchile/beto.