← all repositories

uber/petastorm

A library enabling distributed training of deep learning models from Apache Parquet datasets with native TensorFlow, PyTorch, and PySpark support.

1.9k stars Python Data Tooling
petastorm
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

Petastorm is a data access library developed at Uber ATG that enables single machine or distributed training and evaluation of deep learning models from datasets stored in Apache Parquet format. It provides native integration with popular ML frameworks including TensorFlow, PyTorch, and PySpark, allowing direct data loading into training pipelines. The library handles dataset generation, schema management, and efficient data sharding for distributed training workflows.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.