Debugging CUDA Driver Chaos: A Deep Dive into Kernel Module Version Mismatches

When training large language models like DeepSeek V3 across multiple nodes, the last thing you want is CUDA failing on random nodes in your cluster. Yet that's exactly what happened during a recent distributed training run. What started as a simple "CUDA not available" error turned into a multi-day debugging odyssey through kernel versions, DKMS build failures, and driver-library mismatches.

This post documents the technical journey of debugging and fixing CUDA issues on two nodes (Node 4 and Node 5) in our training cluster, with lessons that could save you hours of troubleshooting.

The Setup

We were running a distributed training setup for DeepSeek V3 LoRA fine-tuning across multiple nodes, each equipped with 8× NVIDIA A100-80GB GPUs. The environment consisted of Ubuntu machines running Oracle kernels, with a mix of kernel versions that would soon become the source of our problems.

The Symptoms

The issues manifested differently on each node:

Node 4:

>>> import torch
>>> torch.cuda.is_available()
False

Node 5:

>>> torch.cuda.is_available()
/home/ubuntu/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
False

Running nvidia-smi revealed the dreaded message:

Failed to initialize NVML: Driver/library version mismatch

The Root Causes

Oracle Kernel Complexity

Both nodes were running Oracle-optimized kernels, which added layers of complexity:

Prebuilt modules vs DKMS: Oracle kernels often ship with prebuilt NVIDIA kernel modules, but package managers don't always recognize this, leading to DKMS attempting to compile modules that already exist.
Version misalignment: We had a mix of kernel versions across nodes:
- 6.5.0-1026-oracle
- 5.15.0-1040-oracle
- 5.15.0-1047-oracle

Node 4: The DKMS Nightmare

Node 4's issues stemmed from attempting to install NVIDIA drivers on the 6.5 Oracle kernel. The key problem was that apt wasn't pulling the matching linux-modules-nvidia-*-6.5.0-1027-oracle package, so it fell back to DKMS compilation.

The DKMS build failed due to kernel configuration mismatches, particularly with CONFIG_X86_KERNEL_IBT. The logs showed compilation errors that pointed to fundamental incompatibilities between the driver source and the Oracle kernel configuration.

Node 5: The Version Mismatch Maze

Node 5 presented a different challenge - a classic driver/library version mismatch:

NVRM: API mismatch: the client has the version 550.163.01, but this kernel module has the version 570.158.01

The complexity was compounded by:

Multiple kernel versions installed (5.15.0-1040 and 5.15.0-1047)
DKMS trying to build for both kernels
Existing NVIDIA 570 modules conflicting with attempts to install 550
Half-configured dpkg states blocking clean installations

The Solutions

Node 4: Embracing Prebuilt Modules

The solution for Node 4 was to work with the Oracle kernel ecosystem rather than against it:

# Boot into the 6.5 Oracle kernel
sudo grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 6.5.0-1026-oracle"
sudo reboot

# Remove conflicting repositories and DKMS packages
sudo add-apt-repository -y -r ppa:graphics-drivers/ppa || true
sudo rm -f /etc/apt/sources.list.d/cuda-ubuntu2204* /etc/apt/sources.list.d/nvidia* || true

# Install the server driver with prebuilt modules
sudo apt-get install -y "linux-modules-nvidia-550-server-$(uname -r)" \
  nvidia-driver-550-server nvidia-fabricmanager-550 nvidia-modprobe

# Enable Fabric Manager for multi-GPU systems
sudo systemctl enable --now nvidia-fabricmanager

The key insight: Use nvidia-driver-550-server (not the standard driver) on datacenter GPUs with Oracle kernels, as it properly pairs with the prebuilt kernel modules.

Node 5: Module Unload and Reload

Node 5 required a more surgical approach to swap out the loaded kernel modules:

#!/usr/bin/env bash
set -euxo pipefail

# Stop services holding GPU resources
sudo systemctl stop nvidia-fabricmanager nvidia-persistenced docker containerd

# Force unload old modules (order matters!)
for m in nvidia_drm nvidia_modeset nvidia_uvm nvidia_peermem nvidia; do
  sudo modprobe -r "$m" 2>/dev/null || sudo rmmod "$m" 2>/dev/null || true
done

# Clean up stale kernel trees to prevent DKMS conflicts
for K in /lib/modules/*; do
  [ -d "$K" ] || continue
  if [[ "$(basename "$K")" != "$(uname -r)" ]]; then
    sudo rm -rf "$K"
  fi
done

# Reinstall matching driver version
sudo apt-get -y purge 'nvidia-*' 'libnvidia-*'
sudo apt-get -y install nvidia-driver-550 nvidia-utils-550 libnvidia-compute-550

# Load fresh modules
sudo modprobe nvidia nvidia_uvm nvidia_modeset nvidia_drm

Docker Runtime Configuration

After fixing the driver issues, Docker containers still couldn't access GPUs due to missing NVIDIA runtime configuration:

# Install NVIDIA Container Toolkit
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify with a test container
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Key Lessons Learned

Oracle kernels require special handling: Always prefer prebuilt NVIDIA modules over DKMS compilation when using Oracle kernels. The -server variant drivers are designed for this use case.
Version consistency is critical: In distributed training, all nodes should run identical driver versions. Even minor version differences (550.163.01 vs 570.158.01) can cause failures.
DKMS can be your enemy: While DKMS promises automatic module compilation, it often fails with specialized kernels. When possible, use prebuilt modules.
Clean state matters: Half-configured packages and stale kernel module trees can block successful installations. Always clean thoroughly before reinstalling.
Module unloading order: When manually unloading NVIDIA modules, the order matters: nvidia_drm → nvidia_modeset → nvidia_uvm → nvidia_peermem → nvidia.
Fabric Manager is essential: For multi-GPU systems, especially with NVSwitch, Fabric Manager must be running. Don't forget to enable it after driver installation.

Debugging Checklist

When facing CUDA availability issues, work through this checklist:

Check kernel version: uname -r
Verify loaded modules: lsmod | grep nvidia
Check module version: cat /sys/module/nvidia/version
Verify userspace version: nvidia-smi (look for driver version)
Check DKMS status: dkms status
Verify package states: dpkg -l | grep nvidia
Check for crash logs: ls /var/crash/nvidia-*.crash
Verify Docker runtime: docker info | grep -i runtime

Conclusion

What started as a simple CUDA initialization error evolved into a masterclass in Linux kernel module management. The complexities of running NVIDIA drivers on Oracle kernels, combined with version mismatches and DKMS quirks, created a perfect storm of debugging challenges.

The experience reinforced that in distributed deep learning infrastructure, standardization is key. Every node should run identical kernel versions, driver versions, and configurations. When that's not possible, understanding the intricate relationships between kernel modules, userspace libraries, and package management becomes essential.

For those running similar setups, I hope this post saves you from the hours of debugging we experienced. Remember: when in doubt, clean everything and start fresh with prebuilt modules designed for your specific kernel variant.

Have you encountered similar CUDA driver issues in your distributed training setups? I'd love to hear about your experiences and solutions in the comments below.

This post is part of my ongoing series about distributed deep learning infrastructure at thomaskalnik.com. Follow for more insights from our DeepSeek V3 training journey.