Zjh-819/LLMDataHub
A curated hub of high-quality datasets for LLM instruction finetuning and training.

Velocity · 7d
+2.9
★ / day
Trend
→steady
star history
LLMDataHub aggregates open-source training corpora for large language models, covering alignment datasets, domain-specific datasets, pretraining corpora, and multimodal datasets. It provides links, size, language, usage guidance, and descriptions for each dataset to help researchers and developers train LLMs like Alpaca, Vicuna, and ChatGLM.