meta-pytorch/torchft
A PyTorch library providing per-step fault tolerance primitives for distributed training jobs.

Velocity · 7d
+0.9
★ / day
Trend
→steady
star history
torchft implements fault tolerance techniques for large-scale PyTorch training including Fault Tolerant DDP, HSDP, LocalSGD, and DiLoCo. It provides coordination primitives, fault tolerant ProcessGroup implementations, and checkpoint transports that allow training jobs to recover from errors without restarting entirely. The library targets distributed training scenarios where node failures can disrupt long-running training runs.