lmmlzn/Awesome-LLMs-Datasets
A curated survey cataloging representative datasets for training, fine-tuning, and evaluating large language models.

Velocity · 7d
+1.7
★ / day
Trend
→steady
star history
This repository aggregates and categorizes existing LLM datasets into five dimensions: pre-training corpora, instruction fine-tuning datasets, preference datasets, evaluation datasets, and traditional NLP datasets. It also covers emerging categories like multimodal LLM datasets and RAG datasets. The collection references a comprehensive survey paper covering 444 datasets across 8 languages and 32 domains.