← all repositories

lmmlzn/Awesome-LLMs-Datasets

A curated survey cataloging representative datasets for training, fine-tuning, and evaluating large language models.

Awesome-LLMs-Datasets
Velocity · 7d
+1.7
★ / day
Trend
steady
star history

This repository aggregates and categorizes existing LLM datasets into five dimensions: pre-training corpora, instruction fine-tuning datasets, preference datasets, evaluation datasets, and traditional NLP datasets. It also covers emerging categories like multimodal LLM datasets and RAG datasets. The collection references a comprehensive survey paper covering 444 datasets across 8 languages and 32 domains.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.