Docta-ai/docta
A Python tool that diagnoses and cures data quality issues for ML models, including fixing label errors in LLM alignment datasets.

Docta is an advanced data-centric AI platform that detects and rectifies issues in training data. It supports tabular, text, and image data as well as pre-trained model embeddings. The open-source version offers training-free data diagnosis, curation, and nutrition services. One key demo shows how to fix human annotation errors in LLM responses from Anthropic’s red teaming dataset (hh-rlhf), making it particularly useful for RLHF pipelines.