← all repositories

rkcosmos/deepcut

Thai word tokenization library using a CNN to predict word boundaries by classifying whether characters are word beginnings.

427 stars Python Data ToolingLanguage Models
deepcut
Velocity · 7d
+0.1
★ / day
Trend
steady
star history

Deepcut is a deep learning-based tokenization library for Thai text. It uses a convolutional neural network trained on the NECTEC BEST corpus to perform binary classification on characters, predicting whether each character marks the beginning of a word. The library provides Python installation via pip, Docker support, and even a JavaScript port (DeepcutJS) for browser-based tokenization. It achieved 98.1% F1 on the test set.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.