Is guwenbert open source?

Yes — Ethan-yt/guwenbert is open source, released under the Apache-2.0 license.

How popular is guwenbert?

Ethan-yt/guwenbert has 566 stars on GitHub.

Where can I find guwenbert?

Ethan-yt/guwenbert is on GitHub at https://github.com/Ethan-yt/guwenbert.

Ethan-yt/guwenbert

RoBERTa, Retrained on 15,000 Ancient Texts

Because modern Chinese BERT models flounder on classical literature, this project continues pre-training RoBERTa on 1.7 billion characters of ancient text to build a dedicated classical Chinese embedding.

★566 stars Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

GuwenBERT is a RoBERTa model continually pre-trained on the 殆知阁古代文献 corpus—15,694 classical Chinese books totaling 1.7 billion characters, all converted to simplified Chinese. It ships with a custom 23,292-character vocabulary built from high-frequency classical tokens, and is packaged for HuggingFace Transformers in base and large variants. The goal is to provide embeddings that actually understand literary Chinese rather than treating it as broken modern Mandarin.

The interesting bit

Instead of training from scratch, the team froze RoBERTa’s Transformer layers for the first 20K steps to learn a classical embedding space, then unfroze everything for another 100K steps—effectively forcing modern Chinese syntactic knowledge to migrate into an ancient context. On a classical NER benchmark, the authors report that this two-stage approach outperformed the most popular modern Chinese RoBERTa by 6.3% and hit competitive accuracy in just 300 steps, which matters when annotated classical datasets are tiny.

Key highlights

Continual pre-training from hfl/chinese-roberta-wwm-ext on 1.7B classical characters.
Custom tokenizer with 23,292 classical characters; AutoTokenizer loads BertTokenizer because RoBERTa’s BPE is ill-suited for Chinese.
Available on HuggingFace Hub as ethanyt/guwenbert-base and ethanyt/guwenbert-large.
Achieved second place in the 2020 “古联杯” classical Chinese NER competition using a simple BERT+CRF pipeline.
Includes Baidu Pan download mirrors for users inside mainland China.

Caveats

The underlying paper was not yet published at the time of the README; citations are currently footnote-only.
The authors note that reported results reflect specific datasets and hyperparameters, and may shift with different random seeds or hardware.
All classical characters were converted to simplified Chinese during preprocessing, which may affect tasks requiring traditional orthography.

Verdict Digital humanists and NLP researchers working with classical Chinese texts should try this first before wrestling with modern Chinese embeddings. If your corpus is strictly modern Mandarin, it is probably the wrong tool entirely.

Frequently asked

What is Ethan-yt/guwenbert?: Because modern Chinese BERT models flounder on classical literature, this project continues pre-training RoBERTa on 1.7 billion characters of ancient text to build a dedicated classical Chinese embedding.
Is guwenbert open source?: Yes — Ethan-yt/guwenbert is open source, released under the Apache-2.0 license.
How popular is guwenbert?: Ethan-yt/guwenbert has 566 stars on GitHub.
Where can I find guwenbert?: Ethan-yt/guwenbert is on GitHub at https://github.com/Ethan-yt/guwenbert.