Thomas Kalnik's Blog

I trained DeepSeek V3 (671B MoE) with LoRA adapters across three nodes (24× A100 80GB) and learned more about infrastructure than I bargained for. This post covers the practical side: the cluster layout, Docker-based reproducibility, and the small-but-critical choices (like picking the right rendezvous port) that let the rest of the system work.

Hardware and topology

3 nodes, each with 8× NVIDIA A100-SXM4-80GB
Oracle Linux–flavored kernels and NVSwitch backplanes
Shared NFS available at /nfs for models and datasets, but I avoided writing checkpoints there during training (details in Part 5)

Sanity checks I ran on each node:

nvidia-smi -L
lspci | grep -i nvidia

I kept GPU enumeration consistent with:

export CUDA_DEVICE_ORDER=PCI_BUS_ID

Without that, multi-node rank mapping can drift and overlap on the same PCI devices. More on the root cause and fix in Part 4.

Why I containerized everything

I started bare-metal and hit dependency drift immediately: NCCL, CUDA, and PyTorch versions varied by node, and TensorBoard missing on one node could crash training. Containerizing LLaMA-Factory eliminated the drift and made bring-up repeatable per-node.

I used the upstream image:

docker run -d \
  --name llamafactory \
  --network host \
  --gpus all \
  --ipc host \
  --shm-size=16g \
  -v /nfs:/nfs \
  -v /home/ubuntu/LLaMA-Factory:/app \
  -v /home/ubuntu:/host_home \
  --workdir /app \
  hiyouga/llamafactory:latest \
  sleep infinity

Mounts I cared about:

/app: my LLaMA-Factory repo
/nfs: big, shared, read-mostly (models, datasets)
/host_home: node-local fast storage for logs and checkpoints

I validated the container could read and write where I needed it to:

docker exec -i llamafactory bash -lc 'test -d /app/src && echo app:OK'
docker exec -i llamafactory bash -lc 'echo ok > /host_home/_container_write_test && ls -l /host_home/_container_write_test'

The rendezvous port that actually worked

Torch distributed needs a rendezvous TCP port accessible to all nodes. I tried a few and consistently ended up back at 39500. I opened it at the instance firewalls and within the container network namespace via --network host to avoid NAT/publishing complexity. If the port hung after a crash, I cleared it before relaunching:

# On each node
sudo netstat -tulnp | grep :39500 || true
sudo fuser -k 39500/tcp 2>/dev/null || true

Head and workers: a disciplined launch sequence

I automated node prep and launches over SSH. A representative sequence looked like this (trimmed for clarity):

# Per node prep (create container, mount volumes, basic health checks)
ssh ubuntu@$HOST <<'REMOTE'
set -e
mkdir -p /home/ubuntu/deepseek_v3_lora_prod_run_1
docker rm -f llamafactory 2>/dev/null || true
# container run (see above)
# health checks
REMOTE

# Launch workers detached so SSH returns quickly
ssh ubuntu@$WORKER "bash -lc 'docker exec -i llamafactory bash -lc "\
  export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7; \
  export CUDA_DEVICE_ORDER=PCI_BUS_ID; \
  export LOCAL_WORLD_SIZE=8; \
  export WORLD_SIZE=24; \
  export MASTER_ADDR=10.18.122.130; \
  export MASTER_PORT=39500; \
  export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,garbage_collection_threshold:0.8; \
  export NCCL_DEBUG=WARN; \
  export NCCL_SOCKET_IFNAME=eth0; \
  export GLOO_SOCKET_IFNAME=eth0; \
  export NCCL_P2P_DISABLE=1; \
  export NCCL_IB_DISABLE=1; \
  export OMP_NUM_THREADS=1; \
  export HF_ENABLE_PARALLEL_LOADING=true; \
  torchrun \
    --nnodes=3 \
    --nproc_per_node=8 \
    --node_rank=$NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    --max_restarts=0 \
    src/train.py [args] \
"'"'" 2>&1 | tee -a /host_home/deepseek_v3_lora_prod_run_1/logs/node${NODE_RANK}.log'"'"

Where [args] includes LoRA, dataset, and DeepSpeed flags. The full invocation and DeepSpeed config are in Part 2.

Node-local output to stay alive

I quickly learned not to write checkpoints directly to NFS during training. DeepSpeed’s ZeRO shards and optimizer state can explode I/O and inode churn. I wrote to node-local storage (/host_home/...) during training, then moved artifacts after. I also ran a tiny janitor to aggressively delete the large non-LoRA state files DeepSpeed keeps trying to save (details and script in Part 5).

Cleaning state between runs

Lingering processes, NCCL shared memory, and half-dead containers will poison the next run. My reset procedure (excerpt):

# Kill dangling processes
sudo pkill -9 -f torchrun || true
sudo pkill -9 -f train.py || true

# Reset GPUs and clear shm
sudo nvidia-smi --gpu-reset || true
sudo rm -rf /tmp/.torch_distributed_* /dev/shm/torch_* /tmp/cuda_* ~/.nv/

# Restart running containers
for c in $(docker ps -q); do docker restart "$c"; done

# Free the rendezvous port
sudo fuser -k 39500/tcp 2>/dev/null || true

I wrapped this in cleanup_post_training_state.sh to make “known-good” a one-liner across nodes.

A checklist that actually prevented fires

CUDA_DEVICE_ORDER=PCI_BUS_ID set on every node
OMP_NUM_THREADS=1 to avoid noisy CPU interference
--network host and a single blessed MASTER_PORT=39500
Checkpointing to node-local storage; NFS only for datasets/models
TensorBoard installed in the container before training

With the physical and container layer stable, I could focus on Torch distributed and DeepSpeed, which I cover in Part 2.