Training DeepSeek V3 on 24× A100s — Part 1: Infrastructure, Containers, and Reproducibility
How I stood up a 3-node A100 cluster, containerized LLaMA-Factory, and tamed the orchestration risks that derail multi-node training before it even starts.
I trained DeepSeek V3 (671B MoE) with LoRA adapters across three nodes (24× A100 80GB) and learned more about infrastructure than I bargained for. This post covers the practical side: the cluster layout, Docker-based reproducibility, and the small-but-critical choices (like picking the right rendezvous port) that let the rest of the system work.
Hardware and topology
- 3 nodes, each with 8× NVIDIA A100-SXM4-80GB
- Oracle Linux–flavored kernels and NVSwitch backplanes
- Shared NFS available at
/nfs
for models and datasets, but I avoided writing checkpoints there during training (details in Part 5)
Sanity checks I ran on each node:
nvidia-smi -L
lspci | grep -i nvidia
I kept GPU enumeration consistent with:
export CUDA_DEVICE_ORDER=PCI_BUS_ID
Without that, multi-node rank mapping can drift and overlap on the same PCI devices. More on the root cause and fix in Part 4.
Why I containerized everything
I started bare-metal and hit dependency drift immediately: NCCL, CUDA, and PyTorch versions varied by node, and TensorBoard missing on one node could crash training. Containerizing LLaMA-Factory eliminated the drift and made bring-up repeatable per-node.
I used the upstream image:
docker run -d \
--name llamafactory \
--network host \
--gpus all \
--ipc host \
--shm-size=16g \
-v /nfs:/nfs \
-v /home/ubuntu/LLaMA-Factory:/app \
-v /home/ubuntu:/host_home \
--workdir /app \
hiyouga/llamafactory:latest \
sleep infinity
Mounts I cared about:
/app
: my LLaMA-Factory repo/nfs
: big, shared, read-mostly (models, datasets)/host_home
: node-local fast storage for logs and checkpoints
I validated the container could read and write where I needed it to:
docker exec -i llamafactory bash -lc 'test -d /app/src && echo app:OK'
docker exec -i llamafactory bash -lc 'echo ok > /host_home/_container_write_test && ls -l /host_home/_container_write_test'
The rendezvous port that actually worked
Torch distributed needs a rendezvous TCP port accessible to all nodes. I tried a few and consistently ended up back at 39500
. I opened it at the instance firewalls and within the container network namespace via --network host
to avoid NAT/publishing complexity. If the port hung after a crash, I cleared it before relaunching:
# On each node
sudo netstat -tulnp | grep :39500 || true
sudo fuser -k 39500/tcp 2>/dev/null || true
Head and workers: a disciplined launch sequence
I automated node prep and launches over SSH. A representative sequence looked like this (trimmed for clarity):
# Per node prep (create container, mount volumes, basic health checks)
ssh ubuntu@$HOST <<'REMOTE'
set -e
mkdir -p /home/ubuntu/deepseek_v3_lora_prod_run_1
docker rm -f llamafactory 2>/dev/null || true
# container run (see above)
# health checks
REMOTE
# Launch workers detached so SSH returns quickly
ssh ubuntu@$WORKER "bash -lc 'docker exec -i llamafactory bash -lc "\
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7; \
export CUDA_DEVICE_ORDER=PCI_BUS_ID; \
export LOCAL_WORLD_SIZE=8; \
export WORLD_SIZE=24; \
export MASTER_ADDR=10.18.122.130; \
export MASTER_PORT=39500; \
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,garbage_collection_threshold:0.8; \
export NCCL_DEBUG=WARN; \
export NCCL_SOCKET_IFNAME=eth0; \
export GLOO_SOCKET_IFNAME=eth0; \
export NCCL_P2P_DISABLE=1; \
export NCCL_IB_DISABLE=1; \
export OMP_NUM_THREADS=1; \
export HF_ENABLE_PARALLEL_LOADING=true; \
torchrun \
--nnodes=3 \
--nproc_per_node=8 \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
--max_restarts=0 \
src/train.py [args] \
"'"'" 2>&1 | tee -a /host_home/deepseek_v3_lora_prod_run_1/logs/node${NODE_RANK}.log'"'"
Where [args]
includes LoRA, dataset, and DeepSpeed flags. The full invocation and DeepSpeed config are in Part 2.
Node-local output to stay alive
I quickly learned not to write checkpoints directly to NFS during training. DeepSpeed’s ZeRO shards and optimizer state can explode I/O and inode churn. I wrote to node-local storage (/host_home/...
) during training, then moved artifacts after. I also ran a tiny janitor to aggressively delete the large non-LoRA state files DeepSpeed keeps trying to save (details and script in Part 5).
Cleaning state between runs
Lingering processes, NCCL shared memory, and half-dead containers will poison the next run. My reset procedure (excerpt):
# Kill dangling processes
sudo pkill -9 -f torchrun || true
sudo pkill -9 -f train.py || true
# Reset GPUs and clear shm
sudo nvidia-smi --gpu-reset || true
sudo rm -rf /tmp/.torch_distributed_* /dev/shm/torch_* /tmp/cuda_* ~/.nv/
# Restart running containers
for c in $(docker ps -q); do docker restart "$c"; done
# Free the rendezvous port
sudo fuser -k 39500/tcp 2>/dev/null || true
I wrapped this in cleanup_post_training_state.sh
to make “known-good” a one-liner across nodes.
A checklist that actually prevented fires
CUDA_DEVICE_ORDER=PCI_BUS_ID
set on every nodeOMP_NUM_THREADS=1
to avoid noisy CPU interference--network host
and a single blessedMASTER_PORT=39500
- Checkpointing to node-local storage; NFS only for datasets/models
- TensorBoard installed in the container before training
With the physical and container layer stable, I could focus on Torch distributed and DeepSpeed, which I cover in Part 2.