Deepseek-v3 training posts


DeepSeek-V3: A Technical Deep Dive into the Innovations Reshaping Open-Source LLMs
DeepSeek-V3 represents a monumental achievement in open-source language models, delivering GPT-4 level performance at a fraction of the training cost. This 671B parameter Mixture-of-Experts (MoE) model, with only 37B parameters activated per token, introduces several groundbreaking innovations that deserve careful technical examination.
Training DeepSeek V3 on 24× A100s — Part 1: Infrastructure, Containers, and Reproducibility
How I stood up a 3-node A100 cluster, containerized LLaMA-Factory, and tamed the orchestration risks that derail multi-node training before it even starts.
Training DeepSeek V3 on 24× A100s — Part 2: torchrun and DeepSpeed ZeRO-3
Exact launch commands, DeepSpeed configs, and how ZeRO-3 + MoE let a 671B model fine-tune stably across 3 nodes.
Training DeepSeek V3 on 24× A100s — Part 3: CUDA, Drivers, and Fabric Manager (802)
Diagnosing cudaGetDeviceCount -> error 802 on NVSwitch systems: aligning kernel, driver, and Fabric Manager branches across nodes without bricking boxes.
Training DeepSeek V3 on 24× A100s — Part 4: NCCL, Networking, and Rank Stability
How I stabilized multi-node rendezvous and NCCL collectives: fixed GPU rank mapping, chose a reliable port, and tamed TCP-only runs without InfiniBand.
Training DeepSeek V3 on 24× A100s — Part 5: Checkpointing, LoRA Saves, and the Janitor
How I avoided 400+ GB checkpoint explosions, fixed empty LoRA saves, and kept NFS from freezing the cluster.