yaodongC/awesome-instruction-dataset
A curated list of open-source instruction-tuning and RLHF datasets for training text and multi-modal instruction-following LLMs.

This repository aggregates publicly available datasets used to fine-tune and train instruction-following large language models. It categorizes datasets by modality (text, visual), generation method (human-generated, self-instruct, mixed), language, and task type. The collection includes resources for training models such as Alpaca, LLaMA, ChatGPT, and GPT-4, as well as red-teaming and RLHF datasets used in reinforcement learning pipelines for LLM alignment.