Deepseek-v3 training posts


DeepSeek-V3: A Technical Deep Dive into the Innovations Reshaping Open-Source LLMs
DeepSeek-V3 represents a monumental achievement in open-source language models, delivering GPT-4 level performance at a fraction of the training cost. This 671B parameter Mixture-of-Experts (MoE) model, with only 37B parameters activated per token, introduces several groundbreaking innovations that deserve careful technical examination.
Training DeepSeek V3 on 24× A100s — Part 1: Infrastructure, Containers, and Reproducibility
How I stood up a 3-node A100 cluster, containerized LLaMA-Factory, and tamed the orchestration risks that derail multi-node training before it even starts.
Training DeepSeek V3 on 24× A100s — Part 2: torchrun and DeepSpeed ZeRO-3
Exact launch commands, DeepSpeed configs, and how ZeRO-3 + MoE let a 671B model fine-tune stably across 3 nodes.
Training DeepSeek V3 on 24× A100s — Part 3: CUDA, Drivers, and Fabric Manager (802)
Diagnosing cudaGetDeviceCount -> error 802 on NVSwitch systems: aligning kernel, driver, and Fabric Manager branches across nodes without bricking boxes.
Training DeepSeek V3 on 24× A100s — Part 4: NCCL, Networking, and Rank Stability
How I stabilized multi-node rendezvous and NCCL collectives: fixed GPU rank mapping, chose a reliable port, and tamed TCP-only runs without InfiniBand.
Training DeepSeek V3 on 24× A100s — Part 5: Checkpointing, LoRA Saves, and the Janitor
How I avoided 400+ GB checkpoint explosions, fixed empty LoRA saves, and kept NFS from freezing the cluster.
Training DeepSeek V3 on 24× A100s — Part 6: Prometheus + Grafana Monitoring
Enable Ray metrics, wire up Prometheus, and import the official Grafana dashboard for real-time visibility during DeepSeek training.
Training DeepSeek V3 on 24× A100s — Part 7: Ray-Orchestrated Training (torchrun under Ray)
Use Ray to prep containers on each node, validate networking, then launch torchrun with DeepSpeed ZeRO-3 and a robust PEFT save patch.
Training DeepSeek V3 on 24× A100s — Part 8: Adapting the Run to DeepSeek‑V3.1
Switching the run to V3.1 mainly requires pointing to the new weights; here are the concrete config and architectural differences captured from the model config.