Ethan-yt/guwenbert
A RoBERTa-based pre-trained language model for Classical Chinese texts and NLP tasks

GuwenBERT is a pre-trained language model specifically designed for classical Chinese (literary Chinese). It is built on the RoBERTa architecture and trained on a large corpus of ancient Chinese texts containing approximately 1.7 billion characters from 15,694 books. The model includes a vocabulary built from high-frequency classical Chinese characters (23,296 entries) and is optimized for classical Chinese NLP tasks including sentence segmentation, punctuation prediction, and named entity recognition.