Training DeepSeek V3 on 24× A100s — Part 4: NCCL, Networking, and Rank Stability

How I stabilized multi-node rendezvous and NCCL collectives: fixed GPU rank mapping, chose a reliable port, and tamed TCP-only runs without InfiniBand.

Even with drivers fixed, my early runs still stalled at the first gradient step or hung during rendezvous. This post catalogs the networking/NCCL settings and rank mapping fixes that made the cluster reliable.

The duplicate GPU mapping bug

I caught a subtle multi-node rank conflict by correlating errors with lspci output:

Error: rank 27 and rank 19 both on CUDA device 53000 (PCI 53:00.0)
Error: rank 29 and rank 21 both on CUDA device 91000 (PCI 91:00.0)

Root cause: inconsistent device ordering across nodes. Fix:

export CUDA_DEVICE_ORDER=PCI_BUS_ID

That forced consistent GPU enumeration and eliminated cross-node device aliasing. Afterward, the expected mapping held:

  • Node 0: ranks 0–7 → local GPUs 0–7
  • Node 1: ranks 8–15 → local GPUs 0–7
  • Node 2: ranks 16–23 → local GPUs 0–7

Rendezvous and a single blessed port

I standardized on MASTER_PORT=39500. It consistently worked across nodes and containers. When in doubt, I verified TCP reachability and freed it if stuck:

# From workers to head, quick TCP sanity
timeout 3 bash -lc 'exec 3<>/dev/tcp/10.18.122.130/39500; echo ok >&3' || true

# Free the port after a crash
sudo fuser -k 39500/tcp 2>/dev/null || true

I avoided Docker port publishing entirely by running containers with --network host.

NCCL over TCP (deliberately no IB at first)

I disabled P2P and InfiniBand for a clean smoke test. The aim was a reproducible baseline before re-enabling RDMA.

export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=1
export TORCH_NCCL_TIMEOUT_MS=3600000
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_IGNORE_CPU_AFFINITY=1
export OMP_NUM_THREADS=1

Only after I had multiple stable steps did I consider flipping NCCL_IB_DISABLE=0 and removing NCCL_P2P_DISABLE=1.

First-step hangs and gradient sharing

Two patterns showed up:

  • Hang at the first gradient accumulation step
  • Hang when determining rank at rendezvous

The combination of explicit WORLD_SIZE, consistent device order, a single stable port, and longer TORCH_NCCL_TIMEOUT_MS moved me past both failure modes. I also kept --max_restarts=0 to prevent confusing reassignments after partial failures.

Logs that mattered

  • TORCH_DISTRIBUTED_DEBUG=INFO to confirm rendezvous and rank assignment
  • NCCL_DEBUG=INFO (or WARN once stable) to see collective failures and timeouts

From one successful DeepSpeed init, the memory/bucket lines confirmed my ZeRO settings took effect:

[Rank 0] Reduce bucket size 500000000
[Rank 0] Prefetch bucket size 500000000
DeepSpeedZeRoOffload initialize [end] ... MA 39.45 GB, CA ~71 GB

Practical networking checklist

  • Containers run with --network host
  • One MASTER_ADDR and one MASTER_PORT used everywhere
  • TCP verified out-of-band before a retry
  • CUDA_DEVICE_ORDER=PCI_BUS_ID set on all nodes
  • NCCL_* environment variables explicit and consistent

With rendezvous and collectives stable, I could finally address the last persistent pain: checkpointing and LoRA saves under ZeRO-3. That’s Part 5.