rkcosmos/deepcut
Thai word tokenization library using a CNN to predict word boundaries by classifying whether characters are word beginnings.

Velocity · 7d
+0.1
★ / day
Trend
→steady
star history
Deepcut is a deep learning-based tokenization library for Thai text. It uses a convolutional neural network trained on the NECTEC BEST corpus to perform binary classification on characters, predicting whether each character marks the beginning of a word. The library provides Python installation via pip, Docker support, and even a JavaScript port (DeepcutJS) for browser-based tokenization. It achieved 98.1% F1 on the test set.