Training DeepSeek V3 on 24× A100s — Part 4: NCCL, Networking, and Rank Stability
How I stabilized multi-node rendezvous and NCCL collectives: fixed GPU rank mapping, chose a reliable port, and tamed TCP-only runs without InfiniBand.
Even with drivers fixed, my early runs still stalled at the first gradient step or hung during rendezvous. This post catalogs the networking/NCCL settings and rank mapping fixes that made the cluster reliable.
The duplicate GPU mapping bug
I caught a subtle multi-node rank conflict by correlating errors with lspci
output:
Error: rank 27 and rank 19 both on CUDA device 53000 (PCI 53:00.0)
Error: rank 29 and rank 21 both on CUDA device 91000 (PCI 91:00.0)
Root cause: inconsistent device ordering across nodes. Fix:
export CUDA_DEVICE_ORDER=PCI_BUS_ID
That forced consistent GPU enumeration and eliminated cross-node device aliasing. Afterward, the expected mapping held:
- Node 0: ranks 0–7 → local GPUs 0–7
- Node 1: ranks 8–15 → local GPUs 0–7
- Node 2: ranks 16–23 → local GPUs 0–7
Rendezvous and a single blessed port
I standardized on MASTER_PORT=39500
. It consistently worked across nodes and containers. When in doubt, I verified TCP reachability and freed it if stuck:
# From workers to head, quick TCP sanity
timeout 3 bash -lc 'exec 3<>/dev/tcp/10.18.122.130/39500; echo ok >&3' || true
# Free the port after a crash
sudo fuser -k 39500/tcp 2>/dev/null || true
I avoided Docker port publishing entirely by running containers with --network host
.
NCCL over TCP (deliberately no IB at first)
I disabled P2P and InfiniBand for a clean smoke test. The aim was a reproducible baseline before re-enabling RDMA.
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=1
export TORCH_NCCL_TIMEOUT_MS=3600000
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_IGNORE_CPU_AFFINITY=1
export OMP_NUM_THREADS=1
Only after I had multiple stable steps did I consider flipping NCCL_IB_DISABLE=0
and removing NCCL_P2P_DISABLE=1
.
First-step hangs and gradient sharing
Two patterns showed up:
- Hang at the first gradient accumulation step
- Hang when determining rank at rendezvous
The combination of explicit WORLD_SIZE
, consistent device order, a single stable port, and longer TORCH_NCCL_TIMEOUT_MS
moved me past both failure modes. I also kept --max_restarts=0
to prevent confusing reassignments after partial failures.
Logs that mattered
TORCH_DISTRIBUTED_DEBUG=INFO
to confirm rendezvous and rank assignmentNCCL_DEBUG=INFO
(orWARN
once stable) to see collective failures and timeouts
From one successful DeepSpeed init, the memory/bucket lines confirmed my ZeRO settings took effect:
[Rank 0] Reduce bucket size 500000000
[Rank 0] Prefetch bucket size 500000000
DeepSpeedZeRoOffload initialize [end] ... MA 39.45 GB, CA ~71 GB
Practical networking checklist
- Containers run with
--network host
- One
MASTER_ADDR
and oneMASTER_PORT
used everywhere - TCP verified out-of-band before a retry
CUDA_DEVICE_ORDER=PCI_BUS_ID
set on all nodesNCCL_*
environment variables explicit and consistent
With rendezvous and collectives stable, I could finally address the last persistent pain: checkpointing and LoRA saves under ZeRO-3. That’s Part 5.