uber/petastorm
A library enabling distributed training of deep learning models from Apache Parquet datasets with native TensorFlow, PyTorch, and PySpark support.

Petastorm is a data access library developed at Uber ATG that enables single machine or distributed training and evaluation of deep learning models from datasets stored in Apache Parquet format. It provides native integration with popular ML frameworks including TensorFlow, PyTorch, and PySpark, allowing direct data loading into training pipelines. The library handles dataset generation, schema management, and efficient data sharding for distributed training workflows.