Yes — uccl-project/uccl is open source, released under the Apache-2.0 license.

What language is uccl written in?

uccl-project/uccl is primarily written in C++.

uccl-project/uccl has 1.5k stars on GitHub.

Where can I find uccl?

uccl-project/uccl is on GitHub at https://github.com/uccl-project/uccl.

uccl-project/uccl

A drop-in NCCL replacement that sprays packets across 256 paths

UCCL exists because NCCL and RCCL weren’t built for the heterogeneity of modern ML clusters or the congestion patterns of multi-tenant clouds.

★1.5k stars C++ Other AI

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

UCCL is a GPU communication library with three modules. UCCL-collective replaces NCCL/RCCL without application changes, aiming for higher throughput and lower latency on everything from H100 RoCE clusters to single-T4 AWS instances. UCCL-P2P handles initiator-target transfers like KV-cache shipping, tuned for 800 Gbps NICs. UCCL-EP ports DeepEP-style expert-parallel communication to heterogeneous hardware—AMD and NVIDIA GPUs across RDMA, EFA, and Broadcom NICs.

The interesting bit

Instead of trusting a single network path and hoping the datacenter isn’t congested, UCCL-collective sprays packets across up to 256 paths in software, runs latency-based and receiver-driven congestion control, and recovers from loss with selective repeat. That design lets it outperform NCCL by up to 2.5× on high-end HGX boxes and 3.7× on lowly g4dn instances with 50 Gbps ENA NICs.

Key highlights

Drop-in API compatibility: set a plugin environment variable and existing PyTorch jobs run unchanged.
Packet spraying with 256 paths, plus advanced congestion control and selective-repeat loss recovery, all in software.
Portable across NVIDIA, AMD, and Broadcom hardware; adopted by AMD TheRock, NVIDIA NeMo, and NVIDIA NIXL.
Three specialized modules: collectives (UCCL-Tran), peer-to-peer transfers (UCCL-P2P), and expert-parallel MoE communication (UCCL-EP).
Benchmarked on USENIX OSDI 2026 papers; ships with nanobind Python bindings and stable-ABI wheels for Python 3.12+.

Caveats

EFA collective support is currently limited to p4d.24xlarge; the authors note that AWS’s official plugin already performs well on newer p5/p5en instances.
Submodule hardware coverage varies: RDMA collectives currently target NVIDIA and Broadcom NICs, while AFXDP fallback handles AWS ENA and IBM VirtIO.

Verdict

Worth a look if you’re running distributed training or inference on mixed-vendor GPU clusters and need more throughput than NCCL gives you out of the box. Less relevant if you’re on a homogeneous, high-end NVIDIA fabric where NCCL is already saturating the links.

Frequently asked

What is uccl-project/uccl?: UCCL exists because NCCL and RCCL weren’t built for the heterogeneity of modern ML clusters or the congestion patterns of multi-tenant clouds.
Is uccl open source?: Yes — uccl-project/uccl is open source, released under the Apache-2.0 license.
What language is uccl written in?: uccl-project/uccl is primarily written in C++.
How popular is uccl?: uccl-project/uccl has 1.5k stars on GitHub.
Where can I find uccl?: uccl-project/uccl is on GitHub at https://github.com/uccl-project/uccl.