← all repositories

OpenDataBox/awesome-data-llm

Academic survey paper and curated collection of research on data-centric techniques for training and preparing LLMs.

Collecting fresh signals — velocity needs a few days of history.
collecting data…
star history

This repository hosts the official materials for a comprehensive survey on LLMs and data-centric methods. It collects and categorizes papers across topics including data acquisition, deduplication, filtering, synthesis, and selection for LLM training. The collection also covers related work on vision-language models and data analytics with LLMs.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.