← all repositories

meta-pytorch/torchft

A PyTorch library providing per-step fault tolerance primitives for distributed training jobs.

509 stars Python ML FrameworksLLMOps · Eval
torchft
Velocity · 7d
+0.9
★ / day
Trend
steady
star history

torchft implements fault tolerance techniques for large-scale PyTorch training including Fault Tolerant DDP, HSDP, LocalSGD, and DiLoCo. It provides coordination primitives, fault tolerant ProcessGroup implementations, and checkpoint transports that allow training jobs to recover from errors without restarting entirely. The library targets distributed training scenarios where node failures can disrupt long-running training runs.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.