← all repositories

voidful/awesome-chatgpt-dataset

A curated collection of datasets for training large language models, with scripts to merge and upload them to Hugging Face Hub.

763 stars Python Data ToolingLearning
awesome-chatgpt-dataset
Velocity · 7d
+0.7
★ / day
Trend
steady
star history

This repository is an awesome list of datasets designed for training ChatGPT-like LLMs. It catalogs datasets across various purposes including alignment (LIMA), safety training (WildGuardMix), function calling (Berkeley Function Calling Leaderboard), and multi-turn conversations (Puffin). The repo includes a preprocessing script that allows users to select datasets, merge them, and upload the combined dataset directly to Hugging Face Hub.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.