Thomas Kalnik's Blog

This post documents the most time-consuming failure mode I hit: nodes where nvidia-smi showed GPUs, but PyTorch returned cudaGetDeviceCount -> error 802 (system not yet initialized). On HGX/A100 NVSwitch systems, a down or mismatched Fabric Manager can poison CUDA init even when the GPUs look fine.

Symptoms I observed

nvidia-smi lists all 8 GPUs
torch.cuda.is_available() → False
Python warns: "Unexpected error from cudaGetDeviceCount()... Error 802: system not yet initialized"
Training stalls at DDP init on those nodes

From my notes:

cudaGetDeviceCount -> 802 (system not yet initialized) with device_count: 8 but is_available: False is classic “driver/runtime initialization is poisoned” — frequently happens on NVSwitch nodes when FM is down or mismatched.

Root cause: driver/FM branch mismatch

I had mixed branches across nodes and even within a single node’s history:

Host driver (kernel module) at 550.163.01
Fabric Manager package pulled in at 570.158.01 (due to Jammy packaging oddities)
DKMS conflicts across two Oracle kernels (5.15.0-1040-oracle and 5.15.0-1047-oracle)

Fabric Manager must match the driver branch on NVSwitch systems. If FM 570 tries to start against a 550 driver, it refuses to run; CUDA initialization breaks with 802 despite GPUs being listed.

What actually fixed it (cleanly)

I aligned the entire node to a single branch. Given the repos at the time, moving to 570 across driver + FM was the cleanest path.
I avoided --force installs that create mixed objects and persistent NVML mismatches.

Checklist I followed:

# Check versions
nvidia-smi --query-gpu=driver_version --format=csv,noheader
systemctl is-active nvidia-fabricmanager || \
  systemctl status nvidia-fabricmanager --no-pager -l
uname -r

# Ensure FM matches the driver branch
sudo apt-get update
sudo apt-get install -y nvidia-fabricmanager-570
sudo systemctl enable nvidia-fabricmanager
sudo systemctl restart nvidia-fabricmanager

# If DKMS remnants exist for old kernels, purge stale entries then reinstall driver/FM on the active kernel only

I also learned that apt on Jammy could pull FM-570 even when requesting -550. The reliable approach was to choose one branch and make both the driver and FM match it.

Kernel flavor and DKMS pitfalls

I experimented with kernel 6.5.0-oracle and 5.15.0-oracle. The issue was not the base kernel itself but mixing Oracle-flavor kernels with DKMS NVIDIA drivers and prebuilt kernel modules. When the system already shipped 570 modules for one kernel, attempting to install 550 via DKMS led to unconfigured packages and NVML mismatches.

From my notes:

Your kernel already has NVIDIA 570.181 modules baked for 5.15.0-1040-oracle, while userspace is 550.163. DKMS refuses to install 550.
Do not --force — you’ll end up with mixed objects and persistent “NVML mismatch”.

Actionable lesson: pick a branch (e.g., 570), ensure the active kernel has matching modules, and avoid cross-branch blends.

Fabric Manager on NVSwitch boxes isn’t optional

On HGX/NVSwitch systems, FM brokers switch initialization. If FM is down or mismatched, CUDA can report 802. After aligning branches, I made sure FM was healthy:

systemctl status nvidia-fabricmanager --no-pager -l
journalctl -u nvidia-fabricmanager -n 200 --no-pager

Only after FM was green did torch.cuda.is_available() turn True on those nodes.

Two-node temptation, three-node reality

I briefly considered dropping to 2 nodes (16 GPUs). The math from my run logs made it clear that was not safe with ZeRO-3 on a 671B base:

24 ranks: ~50 GB/rank observed
16 ranks: 1.5× shard size → ~75 GB/rank before activations/comm buffers

That leaves little headroom on 80 GB cards, and OOMs are likely. I kept 3 nodes as the floor.

Final quick checks

All nodes: same driver/FM branch, same kernel flavor
FM active and healthy before launching containers
torch.cuda.is_available() returns True; no 802

With the driver stack aligned, the remaining stability issues were networking and rendezvous related, which I cover in Part 4.