DevOps Deep Dive: Docker for Multi-Node LLM Training

Container recipe and health checks used to keep multi-node LLaMA-Factory runs consistent.

This post collects the exact Docker commands and checks I used while orchestrating multi-node training. All snippets come from my notes repo.

Base container run (host networking, GPUs, large shm)

docker run -d \
  --name llamafactory \
  --network host \
  --gpus all \
  --ipc host \
  --shm-size=16g \
  -v /nfs:/nfs \
  -v /home/ubuntu/LLaMA-Factory:/app \
  --workdir /app \
  hiyouga/llamafactory:latest \
  sleep infinity

Enter the container:

docker exec -it llamafactory /bin/bash

Graceful stop: kill in-container training processes

docker exec -it llamafactory bash -lc '
  echo "Killing training...";
  pkill -f "torchrun|deepspeed|accelerate|src/train.py" || true;
  sleep 1;
  pgrep -fal "torchrun|deepspeed|accelerate|src/train.py" || true
'

Permissions and quick NFS sanity

sudo chmod -R 777 /nfs/deepseek_v3_lora_prod_run_5

# NFS visibility
dmesg | tail -n 200 | sed -n '/nfs\|rpc\|lockd\|sunrpc/Ip'
sudo journalctl -k --no-pager -n 200 | sed -n '/nfs\|rpc\|lockd\|sunrpc/Ip'
showmount -e 10.18.126.17 2>&1 | sed -n '1,120p'

Find largest files when space gets tight

sudo find / -xdev -type f -printf '%s\t%p\n' 2>/dev/null \
| sort -nr | head -n 50 \
| numfmt --to=iec --suffix=B --field=1 --padding=7

These basics made container bring-up reproducible across nodes and kept the runtime surface small when debugging training issues.