Deepseek-v3 training posts

Deepseek v3 Ray Cluster Dashboard
Ray Cluster Dashboard
Deepseek v3 training run overview
TensorBoard Charts for Deepseek v3 SFT (183 steps)

DeepSeek-V3: A Technical Deep Dive into the Innovations Reshaping Open-Source LLMs

DeepSeek-V3 represents a monumental achievement in open-source language models, delivering GPT-4 level performance at a fraction of the training cost. This 671B parameter Mixture-of-Experts (MoE) model, with only 37B parameters activated per token, introduces several groundbreaking innovations that deserve careful technical examination.

Training DeepSeek V3 on 24× A100s — Part 1: Infrastructure, Containers, and Reproducibility

How I stood up a 3-node A100 cluster, containerized LLaMA-Factory, and tamed the orchestration risks that derail multi-node training before it even starts.

Training DeepSeek V3 on 24× A100s — Part 2: torchrun and DeepSpeed ZeRO-3

Exact launch commands, DeepSpeed configs, and how ZeRO-3 + MoE let a 671B model fine-tune stably across 3 nodes.

Training DeepSeek V3 on 24× A100s — Part 3: CUDA, Drivers, and Fabric Manager (802)

Diagnosing cudaGetDeviceCount -> error 802 on NVSwitch systems: aligning kernel, driver, and Fabric Manager branches across nodes without bricking boxes.

Training DeepSeek V3 on 24× A100s — Part 4: NCCL, Networking, and Rank Stability

How I stabilized multi-node rendezvous and NCCL collectives: fixed GPU rank mapping, chose a reliable port, and tamed TCP-only runs without InfiniBand.

Training DeepSeek V3 on 24× A100s — Part 5: Checkpointing, LoRA Saves, and the Janitor

How I avoided 400+ GB checkpoint explosions, fixed empty LoRA saves, and kept NFS from freezing the cluster.

Training DeepSeek V3 on 24× A100s — Part 6: Prometheus + Grafana Monitoring

Enable Ray metrics, wire up Prometheus, and import the official Grafana dashboard for real-time visibility during DeepSeek training.

Training DeepSeek V3 on 24× A100s — Part 7: Ray-Orchestrated Training (torchrun under Ray)

Use Ray to prep containers on each node, validate networking, then launch torchrun with DeepSpeed ZeRO-3 and a robust PEFT save patch.

Training DeepSeek V3 on 24× A100s — Part 8: Adapting the Run to DeepSeek‑V3.1

Switching the run to V3.1 mainly requires pointing to the new weights; here are the concrete config and architectural differences captured from the model config.