uccl-project/uccl
A high-performance GPU communication library providing collectives, P2P transfers, and EP primitives optimized for distributed ML training and LLM inference.

UCCL provides efficient GPU-to-GPU communication primitives including all-reduce, P2P transfers for KV cache and RL weight synchronization, and endpoint operations. It operates as a drop-in replacement for NCCL/RCCL requiring no application code changes, significantly outperforming them in both latency and throughput. The library focuses on flexibility for fast-evolving ML workloads and portability across heterogeneous GPU environments including NVIDIA and AMD hardware.