voidful/awesome-chatgpt-dataset
A curated collection of datasets for training large language models, with scripts to merge and upload them to Hugging Face Hub.

This repository is an awesome list of datasets designed for training ChatGPT-like LLMs. It catalogs datasets across various purposes including alignment (LIMA), safety training (WildGuardMix), function calling (Berkeley Function Calling Leaderboard), and multi-turn conversations (Puffin). The repo includes a preprocessing script that allows users to select datasets, merge them, and upload the combined dataset directly to Hugging Face Hub.