DevOps Deep Dive: Docker for Multi-Node LLM Training
Container recipe and health checks used to keep multi-node LLaMA-Factory runs consistent.
This post collects the exact Docker commands and checks I used while orchestrating multi-node training. All snippets come from my notes repo.
Base container run (host networking, GPUs, large shm)
docker run -d \
--name llamafactory \
--network host \
--gpus all \
--ipc host \
--shm-size=16g \
-v /nfs:/nfs \
-v /home/ubuntu/LLaMA-Factory:/app \
--workdir /app \
hiyouga/llamafactory:latest \
sleep infinity
Enter the container:
docker exec -it llamafactory /bin/bash
Graceful stop: kill in-container training processes
docker exec -it llamafactory bash -lc '
echo "Killing training...";
pkill -f "torchrun|deepspeed|accelerate|src/train.py" || true;
sleep 1;
pgrep -fal "torchrun|deepspeed|accelerate|src/train.py" || true
'
Permissions and quick NFS sanity
sudo chmod -R 777 /nfs/deepseek_v3_lora_prod_run_5
# NFS visibility
dmesg | tail -n 200 | sed -n '/nfs\|rpc\|lockd\|sunrpc/Ip'
sudo journalctl -k --no-pager -n 200 | sed -n '/nfs\|rpc\|lockd\|sunrpc/Ip'
showmount -e 10.18.126.17 2>&1 | sed -n '1,120p'
Find largest files when space gets tight
sudo find / -xdev -type f -printf '%s\t%p\n' 2>/dev/null \
| sort -nr | head -n 50 \
| numfmt --to=iec --suffix=B --field=1 --padding=7
These basics made container bring-up reproducible across nodes and kept the runtime surface small when debugging training issues.