DevOps Deep Dive: torchrun for Stable Multi-Node SFT
Exact environment and launch patterns from production torchrun scripts, including DDP probe and ZeRO-3 settings.
This collects the stable torchrun environment and launch patterns I used across three or four nodes. Snippets are taken from the production scripts in my notes repo.
Environment that eliminated surprises
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export LOCAL_WORLD_SIZE=${NPROC_PER_NODE}
export WORLD_SIZE=$(( ${NNODES} * ${NPROC_PER_NODE} ))
export MASTER_ADDR=10.18.122.130
export MASTER_PORT=39500
export NCCL_SOCKET_IFNAME=eth0
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_FAMILY=AF_INET
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=1
export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export NCCL_TIMEOUT_MS=5400000
export TORCH_NCCL_TIMEOUT_MS=5400000
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True,garbage_collection_threshold:0.8"
export OMP_NUM_THREADS=1
export HF_ENABLE_PARALLEL_LOADING=true
One-proc DDP/NCCL probe
torchrun \
--nnodes="$NNODES" \
--nproc_per_node=1 \
--node_rank="$NODE_RANK" \
--master_addr="$MASTER_ADDR" \
--master_port="$MASTER_PORT" \
--max_restarts=0 \
/tmp/ddp_nccl_probe.py
Training launch (LoRA SFT)
torchrun \
--nnodes="$NNODES" \
--nproc_per_node="$NPROC_PER_NODE" \
--node_rank="$NODE_RANK" \
--master_addr="$MASTER_ADDR" \
--master_port="$MASTER_PORT" \
--max_restarts=0 \
src/train.py \
--stage sft \
--do_train \
--model_name_or_path /nfs/DeepSeek-V3-bf16 \
--dataset all_creator_training \
--template default \
--finetuning_type lora \
--lora_target self_attn.q_a_proj,self_attn.q_b_proj,self_attn.kv_a_proj_with_mqa,self_attn.kv_b_proj,self_attn.o_proj \
--lora_rank 16 \
--lora_alpha 32 \
--output_dir "$OUTPUT_DIR" \
--overwrite_output_dir \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-5 \
--adam_beta2 0.98 \
--weight_decay 0.01 \
--warmup_steps 100 \
--bf16 \
--deepspeed "$DS_CONFIG" \
--logging_steps 1 \
--save_strategy steps \
--save_steps 50 \
--save_on_each_node false \
--save_safetensors true \
--save_only_model true \
--max_steps 2000 \
--report_to tensorboard \
--logging_dir "$OUTPUT_DIR/logs"
These settings tracked my most stable runs; InfiniBand can be enabled later by flipping NCCL_IB_DISABLE=0 and NCCL_P2P_DISABLE=0 once the smoke tests are clean.