DevOps Deep Dive: torchrun for Stable Multi-Node SFT

Exact environment and launch patterns from production torchrun scripts, including DDP probe and ZeRO-3 settings.

This collects the stable torchrun environment and launch patterns I used across three or four nodes. Snippets are taken from the production scripts in my notes repo.

Environment that eliminated surprises

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export LOCAL_WORLD_SIZE=${NPROC_PER_NODE}
export WORLD_SIZE=$(( ${NNODES} * ${NPROC_PER_NODE} ))

export MASTER_ADDR=10.18.122.130
export MASTER_PORT=39500
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_FAMILY=AF_INET
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=1
export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export NCCL_TIMEOUT_MS=5400000
export TORCH_NCCL_TIMEOUT_MS=5400000
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True,garbage_collection_threshold:0.8"
export OMP_NUM_THREADS=1
export HF_ENABLE_PARALLEL_LOADING=true

One-proc DDP/NCCL probe

torchrun \
  --nnodes="$NNODES" \
  --nproc_per_node=1 \
  --node_rank="$NODE_RANK" \
  --master_addr="$MASTER_ADDR" \
  --master_port="$MASTER_PORT" \
  --max_restarts=0 \
  /tmp/ddp_nccl_probe.py

Training launch (LoRA SFT)

torchrun \
  --nnodes="$NNODES" \
  --nproc_per_node="$NPROC_PER_NODE" \
  --node_rank="$NODE_RANK" \
  --master_addr="$MASTER_ADDR" \
  --master_port="$MASTER_PORT" \
  --max_restarts=0 \
  src/train.py \
  --stage sft \
  --do_train \
  --model_name_or_path /nfs/DeepSeek-V3-bf16 \
  --dataset all_creator_training \
  --template default \
  --finetuning_type lora \
  --lora_target self_attn.q_a_proj,self_attn.q_b_proj,self_attn.kv_a_proj_with_mqa,self_attn.kv_b_proj,self_attn.o_proj \
  --lora_rank 16 \
  --lora_alpha 32 \
  --output_dir "$OUTPUT_DIR" \
  --overwrite_output_dir \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-5 \
  --adam_beta2 0.98 \
  --weight_decay 0.01 \
  --warmup_steps 100 \
  --bf16 \
  --deepspeed "$DS_CONFIG" \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 50 \
  --save_on_each_node false \
  --save_safetensors true \
  --save_only_model true \
  --max_steps 2000 \
  --report_to tensorboard \
  --logging_dir "$OUTPUT_DIR/logs"

These settings tracked my most stable runs; InfiniBand can be enabled later by flipping NCCL_IB_DISABLE=0 and NCCL_P2P_DISABLE=0 once the smoke tests are clean.