Training DeepSeek V3 on 24× A100s — Part 8: Adapting the Run to DeepSeek‑V3.1

Switching the run to V3.1 mainly requires pointing to the new weights; here are the concrete config and architectural differences captured from the model config.

DeepSeek‑V3.1 drops into the same training stack with minimal changes. This post highlights what changes (paths and a few targets) and what I verified from the model configuration you’ll actually run.

What changes for V3.1

  • Model path changes from /nfs/DeepSeek-V3-bf16/nfs/DeepSeek-V3.1-bf16
  • The Ray orchestration, NCCL setup, DeepSpeed ZeRO‑3, and LoRA save patch remain identical
  • LoRA targets continue to use the DeepSeek attention module names (e.g., self_attn.q_a_proj, self_attn.kv_a_proj_with_mqa, …) as in the Ray script

A representative torchrun launch (same as Part 7 with only the model path updated):

torchrun \
  --nnodes="$NNODES" \
  --nproc_per_node="$NPROC_PER_NODE" \
  --node_rank="$NODE_RANK" \
  --master_addr="$MASTER_ADDR" \
  --master_port="$MASTER_PORT" \
  --max_restarts=0 \
  src/train.py \
  --stage sft \
  --do_train \
  --model_name_or_path /nfs/DeepSeek-V3.1-bf16 \
  --dataset all_creator_training \
  --template default \
  --finetuning_type lora \
  --preprocessing_num_workers 4 \
  --overwrite_cache false \
  --lora_target self_attn.q_a_proj,self_attn.q_b_proj,self_attn.kv_a_proj_with_mqa,self_attn.kv_b_proj,self_attn.o_proj \
  --lora_rank 16 \
  --lora_alpha 32 \
  --output_dir "$OUTPUT_DIR" \
  --overwrite_output_dir \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-5 \
  --adam_beta2 0.98 \
  --weight_decay 0.01 \
  --warmup_steps 100 \
  --bf16 \
  --deepspeed "$DS_CONFIG" \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 50 \
  --save_on_each_node false \
  --save_safetensors true \
  --save_only_model true \
  --max_steps 2000 \
  --report_to tensorboard \
  --logging_dir "$OUTPUT_DIR/logs"

Architecture notes from the V3.1 model config

From the provided config.json for DeepSeek‑V3 family (subset of fields):

{
  "architectures": ["DeepseekV3ForCausalLM"],
  "model_type": "deepseek_v3",
  "num_hidden_layers": 61,
  "hidden_size": 7168,
  "num_attention_heads": 128,
  "num_key_value_heads": 128,
  "n_routed_experts": 256,
  "n_shared_experts": 1,
  "num_experts_per_tok": 8,
  "topk_method": "noaux_tc",
  "topk_group": 4,
  "routed_scaling_factor": 2.5,
  "moe_layer_freq": 1,
  "rope_scaling": { "type": "yarn", "factor": 40, "beta_fast": 32, "beta_slow": 1 },
  "torch_dtype": "bfloat16"
}

Implications for training:

  • MoE remains: 256 routed experts + 1 shared expert, with 8 experts active per token
  • The routing method is indicated as noaux_tc (auxiliary‑loss‑free top‑k with group routing)
  • RoPE uses YaRN scaling parameters; context scaling is built into the config
  • BF16 is the default dtype; matches our run settings

These fields confirm that the V3.1 drop‑in continues to use the same major architectural assumptions as V3 for our LoRA SFT runs. The LoRA target modules and ZeRO‑3 settings used in Parts 2 and 7 remain compatible.

Filesystem expectations

Ensure the weights are available at:

/nfs/DeepSeek-V3.1-bf16

Output directory remains on shared or node‑local storage depending on your choice (I keep logs under $OUTPUT_DIR/logs and verify adapters post‑run).

Post‑run verification

The Ray driver’s verification step remains useful: it scans checkpoint-* directories and validates that adapter_model.safetensors exists and contains non‑empty tensors, also writing adapter_debug_info.json for quick triage.