← all repositories

Ethan-yt/guwenbert

A RoBERTa-based pre-trained language model for Classical Chinese texts and NLP tasks

563 stars Language Models
guwenbert
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

GuwenBERT is a pre-trained language model specifically designed for classical Chinese (literary Chinese). It is built on the RoBERTa architecture and trained on a large corpus of ancient Chinese texts containing approximately 1.7 billion characters from 15,694 books. The model includes a vocabulary built from high-frequency classical Chinese characters (23,296 entries) and is optimized for classical Chinese NLP tasks including sentence segmentation, punctuation prediction, and named entity recognition.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.