← all repositories
AstarLight/CPS-OCR-Engine

When Tesseract fails and Baidu bills you, build your own

A Chinese-printed-character OCR engine born from frustration with existing tools and a university finance-office side project.

1.1k stars Python Computer VisionData Tooling
CPS-OCR-Engine
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does CPS-OCR-Engine recognizes 3,755 printed Chinese characters (Level 1 character set) from scanned documents, IDs, and invoices. It trains on synthetically generated data and runs inference by dropping images into a tmp directory. The author built it to power an intelligent bill-processing system for their university’s finance office.

The interesting bit The synthetic data pipeline is the quiet workhorse: gen_printed_char.py renders training samples from Chinese font files with configurable rotation, margins, and sizes. No manual labeling required. The author claims top-1 accuracy of 0.99826 and top-5 of 0.99989, though the benchmark source and test conditions are unspecified.

Key highlights

  • Synthetic training data generation from fonts with rotation up to 30 degrees
  • Single-script workflow: train, validate, and infer through Chinese_OCR.py modes
  • Pre-trained model distributed via Baidu Pan (link + password in README)
  • Focused scope: printed Chinese only, not handwritten or multi-language
  • Accompanying blog post with implementation details (Chinese language)

Caveats

  • README is entirely in Chinese; code comments and CLI help may be too
  • Pre-trained model hosted on Baidu Pan, which requires an account and is region-restricted
  • No mention of framework version, dependencies, or installation steps
  • Character recognition requires pre-segmented single-character images; no line or paragraph detection shown

Verdict Worth a look if you need printed Chinese OCR and can read Chinese documentation or don’t mind spelunking. Skip if you need multilingual support, handwriting recognition, or a batteries-included pipeline with text detection and layout analysis.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.