← all repositories
huggingface/datasets

The plumbing behind every Hugging Face model

A library that turns 'download this 50GB dataset' into a one-liner with streaming, caching, and zero-copy memory mapping.

21.6k stars Python Data ToolingLearning
datasets
Velocity · 7d
+9.5
★ / day
Trend
steady
star history

What it does

🤗 Datasets is the data loading and preprocessing layer that sits between raw data and model training. It fetches public datasets from the Hugging Face Hub with load_dataset("name"), handles local files in a dozen formats (CSV, Parquet, JSON, HDF5, NIfTI, PDF, etc.), and provides a uniform API for transforming, batching, and exporting to NumPy, Pandas, PyTorch, TensorFlow, JAX, or Spark.

The interesting bit

The library is built on Apache Arrow, which means datasets are memory-mapped rather than loaded into RAM — you can work with terabyte-scale data on a laptop without swapping. Add automatic caching of map() transformations and a streaming mode that iterates on-the-fly, and you have infrastructure that mostly disappears until you need it.

Key highlights

  • One-line loaders for thousands of public datasets (text, audio, image, video, 3D medical, agent traces)
  • Memory-mapped Arrow backend with zero-copy reads
  • Streaming mode with claimed 100× speedup via Xet backend
  • Smart caching: processed datasets are reused automatically
  • Built-in FAISS and Elasticsearch indexing for similarity search
  • Multi-processing map() for parallel transforms
  • Native export to Polars, PyTorch, TensorFlow, JAX, and Spark

Caveats

  • The “100× faster” streaming claim is in the README but lacks context on what baseline it’s measured against
  • Optional dependencies are fragmented: audio, vision, PDFs, and framework integrations each need separate extras installs
  • Dataset revisions can shift; the README explicitly warns users to pin revision for reproducibility

Verdict

Essential if you train or evaluate models on Hugging Face Hub datasets. Overkill if your data fits in a single Parquet file and you already have a Pandas pipeline that works.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.