Data Tooling

heavyweights · gaining speed

+0.1 ★/day→steady

A PHP-native library that brings text analysis, sentiment scoring, and document classification to codebases that can't justify a Python microservice.

★ 533 PHP Language Models · explained

AdolfVonKleist/Phonetisaurus

+0.1 ★/day→steady

Trains models that guess how words sound, because you can't ship a pronunciation dictionary for every proper noun the user will invent.

★ 516 Shell Data Tooling · explained

dbpedia-spotlight/dbpedia-spotlight

+0.1 ★/day→steady

The original DBpedia Spotlight entity linker still works, but the maintainers have packed up and left for a cleaner, Apache-licensed rewrite.

★ 759 Scala RAG · Search · explained

FerreroJeremy/ln2sql

+0.1 ★/day→steady

A Python tool that parses natural language questions and turns them into executable SQL using only a database dump—no live connection required.

★ 521 Python Language Models · explained

openml/OpenML

+0.2 ★/day→steady

A 2013-vintage open-science platform for sharing ML experiments, datasets, and results—now being retired in favor of a FastAPI rewrite.

★ 741 PHP Data Tooling · explained

synyi/poplar

+0.1 ★/day→steady

A browser-native NLP annotation component for when you need to label text without leaving the DOM.

★ 529 TypeScript Data Tooling · explained

anoopkunchukuttan/indic_nlp_library

+0.1 ★/day→steady

A Python library that treats Hindi, Tamil, Bengali, and friends as a family rather than isolated problems.

★ 638 Python Language Models · explained

mlcommons/ck

+0.2 ★/day→steady

A community-built automation framework trying to make ML benchmarking reproducible across the chaos of GPUs, containers, and constantly shifting software stacks.

★ 647 Python LLMOps · Eval · explained

hackingmaterials/matminer

+0.2 ★/day→steady

Matminer collects scattered materials-science datasets and featurizers into one library so researchers can stop writing the same data-prep scripts.

★ 601 HTML Domain Apps · explained

meta-toolkit/meta

+0.2 ★/day→steady

MeTA bundles tokenization, search indexes, topic models, and CRFs into one compiled toolkit for researchers who'd rather fight algorithms than package managers.

★ 714 C++ Language Models · explained

nonamestreet/weixin_public_corpus

+0.2 ★/day→steady

A scraped, cleaned corpus of WeChat public account articles in JSON format, released for research use.

★ 594 Data Tooling · explained

fhamborg/Giveme5W1H

+0.2 ★/day→steady

A Python library that reverse-engineers the 5W1H structure from news articles, because someone finally decided to treat reporters' training as a spec.

★ 533 HTML Data Tooling · explained

ryouchinsa/Rectlabel-support

+0.2 ★/day→steady

RectLabel is a commercial macOS app whose support repo reveals an unusually deep stack of offline ML models for labeling images and video.

★ 553 Jupyter Notebook Data Tooling · explained

explosion/prodigy-recipes

+0.2 ★/day→steady

A public repo of commented, tweakable scripts for Explosion's commercial annotation tool.

★ 507 Jupyter Notebook Data Tooling · explained

CornellNLP/ConvoKit

+0.2 ★/day→steady

A scikit-learn-flavored toolkit that turns messy conversations into measurable social signals.

★ 635 Jupyter Notebook Data Tooling · explained

atilika/kuromoji

+0.2 ★/day→steady

A self-contained Java morphological analyzer that ships its own dictionaries so you don't have to wrestle with MeCab.

★ 1k Java Data Tooling · explained

mint-lab/awesome-robotics-datasets

+0.2 ★/day→steady

Because finding the right robotics dataset shouldn't require a PhD in search-engine optimization.

★ 510 Data Tooling · explained

Jakobovski/free-spoken-digit-dataset

+0.2 ★/day→steady

A tidy, versioned dataset of 3,000 spoken digits for when your model needs to learn what "seven" sounds like at 8kHz.

★ 677 Python Data Tooling · explained

adbar/German-NLP

+0.2 ★/day→steady

Someone finally catalogued the chaos of German-language NLP resources so you don't have to hunt through CLARIN portals at 2am.

★ 524 Learning · explained

davidsbatista/Annotated-Semantic-Relationships-Datasets

+0.2 ★/day→steady

Someone finally collected all the scattered NLP relation-extraction datasets into one repo so you don't have to hunt through decade-old conference websites.

★ 707 Data Tooling · explained

loading more…