← all repositories
NVIDIA/nccl

The glue that keeps GPU clusters from choking on their own data

NCCL is NVIDIA's answer to the question "how do we make 256 GPUs talk to each other without the network becoming the bottleneck?"

4.8k stars C++ Other AIML Frameworks
nccl
Velocity · 7d
+1.2
★ / day
Trend
steady
star history

What it does

NCCL (“Nickel”) provides standard collective communication routines—think all-reduce, broadcast, all-gather—for multi-GPU setups. It abstracts away whether your GPUs are chatting over PCIe, NVLink, NVSwitch, or InfiniBand, and works across single nodes or clusters. If you’ve trained a large model on multiple GPUs, you’ve almost certainly used it, probably without knowing.

The interesting bit

The README is almost aggressively understated for a library that sits at the center of modern distributed deep learning. The real work isn’t the API—it’s the topology-aware routing and bandwidth optimization hiding behind those innocent-looking all_reduce calls. NVIDIA keeps the tests in a separate repo, which either shows admirable separation of concerns or a quiet admission that verifying correctness across arbitrary GPU topologies is its own beast.

Key highlights

  • Implements all-reduce, all-gather, reduce, broadcast, reduce-scatter, plus arbitrary send/receive patterns
  • Optimized for PCIe, NVLink, NVSwitch, InfiniBand Verbs, and TCP/IP sockets
  • Supports arbitrary GPU counts across single or multiple nodes
  • Works with single-process or multi-process (MPI) applications
  • Official prebuilt binaries available; source build uses standard make with architecture-specific compilation flags
  • Packaging support for Debian, RedHat/CentOS, and generic tarballs

Caveats

  • Tests live in a separate repository (nccl-tests), so you’ll need to clone twice to verify your build
  • README copyright notice stops at 2020, which may or may not reflect actual maintenance cadence
  • Source builds default to all CUDA architectures; you’ll want to override NVCC_GENCODE unless you enjoy long compile times and bloated binaries

Verdict

Essential if you’re building or debugging distributed GPU workloads; invisible if you’re using PyTorch or TensorFlow, which bundle it. Worth understanding when your multi-node training hangs mysteriously at 87% GPU utilization.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.