Deepseek-v3 training posts

Deepseek v3 Ray Cluster Dashboard
Ray Cluster Dashboard
Deepseek v3 training run overview
TensorBoard Charts for Deepseek v3 SFT (183 steps)

DeepSeek-V3: A Technical Deep Dive into the Innovations Reshaping Open-Source LLMs

DeepSeek-V3 represents a monumental achievement in open-source language models, delivering GPT-4 level performance at a fraction of the training cost. This 671B parameter Mixture-of-Experts (MoE) model, with only 37B parameters activated per token, introduces several groundbreaking innovations that deserve careful technical examination.

Training DeepSeek V3 on 24× A100s — Part 1: Infrastructure, Containers, and Reproducibility

How I stood up a 3-node A100 cluster, containerized LLaMA-Factory, and tamed the orchestration risks that derail multi-node training before it even starts.

Training DeepSeek V3 on 24× A100s — Part 2: torchrun and DeepSpeed ZeRO-3

Exact launch commands, DeepSpeed configs, and how ZeRO-3 + MoE let a 671B model fine-tune stably across 3 nodes.

Training DeepSeek V3 on 24× A100s — Part 3: CUDA, Drivers, and Fabric Manager (802)

Diagnosing cudaGetDeviceCount -> error 802 on NVSwitch systems: aligning kernel, driver, and Fabric Manager branches across nodes without bricking boxes.

Training DeepSeek V3 on 24× A100s — Part 4: NCCL, Networking, and Rank Stability

How I stabilized multi-node rendezvous and NCCL collectives: fixed GPU rank mapping, chose a reliable port, and tamed TCP-only runs without InfiniBand.

Training DeepSeek V3 on 24× A100s — Part 5: Checkpointing, LoRA Saves, and the Janitor

How I avoided 400+ GB checkpoint explosions, fixed empty LoRA saves, and kept NFS from freezing the cluster.