Training DeepSeek V3 on 24× A100s — Part 5: Checkpointing, LoRA Saves, and the Janitor
How I avoided 400+ GB checkpoint explosions, fixed empty LoRA saves, and kept NFS from freezing the cluster.
Saving models sounds trivial until ZeRO-3 shards and optimizer states enter the chat. I learned to be opinionated about what gets saved, where it gets saved, and what gets deleted immediately after.
The problem: DeepSpeed wants to save everything
My early runs wrote gargantuan checkpoint trees inside Docker’s overlay filesystem:
/var/lib/docker/overlay2/.../diff/tmp/deepseek_v3_lora_smoketest/checkpoint-2/global_step2/
zero_pp_rank_0_mp_rank_00_model_states.pt (~53 GB)
zero_pp_rank_1_mp_rank_00_model_states.pt (~53 GB)
...
That directory ballooned past 400 GB on a single host. Two compounding issues:
- Saving optimizer state and ZeRO shards I didn’t need for a LoRA-only run
- Writing into container overlay instead of fast node-local storage
What I saved (and what I didn’t)
I made saves strictly adapter-centric:
--save_only_model true
--save_safetensors true
--save_on_each_node false
--output_dir /host_home/deepseek_v3_lora_prod_run_1
And in DeepSpeed, I ensured the model weights could be gathered on save, otherwise PEFT could produce an empty adapter header if target modules weren’t found:
"stage3_gather_16bit_weights_on_model_save": true
I also fixed my LoRA target list to match DeepSeek’s module names:
--lora_target q_proj,v_proj,k_proj,o_proj
This combination solved the “saved nothing” issue and kept artifacts small.
Don’t write checkpoints to NFS during training
NFS worked great for models and datasets. It did not enjoy checkpoint storms. I pointed --output_dir
to node-local storage (mounted in the container at /host_home/...
) and synced artifacts later.
When I accidentally saved to NFS, the head node stalled so badly that even SSH banner exchange timed out while TCP connects succeeded — classic IO saturation.
The janitor: deleting the right files constantly
Even with --save_only_model
, libraries can still emit ZeRO/optimizer artifacts at times. I ran a tiny background script inside each container to nuke anything I didn’t want, every few seconds.
#!/usr/bin/env bash
set -euo pipefail
DIR="${OUTPUT_DIR:-/app/local_checkpoints/deepseek_v3_lora_smoketest}"
while true; do
find "$DIR" -type f -name "*_optim_states.pt" -delete 2>/dev/null
find "$DIR" -type f -name "*_model_states.pt" -delete 2>/dev/null
find "$DIR" -type f -name "optimizer.pt" -delete 2>/dev/null
find "$DIR" -type f -name "scheduler.pt" -delete 2>/dev/null
find "$DIR" -type f -name "rng_state.pth" -delete 2>/dev/null
find "$DIR" -type f -name "zero_pp_rank_*_model_states.pt" -delete 2>/dev/null
sleep 3
done
I launched it alongside training:
nohup /tmp/ckpt_janitor.sh >/tmp/ckpt_janitor.log 2>&1 &
Yes, it’s blunt. It prevented both disk exhaustion and IO-related stalls.
Clean restarts without ghosts
After failures, I used a consistent cleanup sequence across nodes to remove distributed state and free the rendezvous port:
sudo pkill -9 -f torchrun || true
sudo pkill -9 -f train.py || true
sudo nvidia-smi --gpu-reset || true
sudo rm -rf /tmp/.torch_distributed_* /dev/shm/torch_* /tmp/cuda_* ~/.nv/
sudo fuser -k 39500/tcp 2>/dev/null || true
I wrapped this in cleanup_post_training_state.sh
and kept a separate kill_docker_deepseek_training.sh
to kill processes inside containers across nodes in parallel.
What finally worked for me
- Save only adapters and in
safetensors
- Gather 16-bit weights on save for ZeRO-3
- Point output to node-local storage; never NFS during training
- Run the janitor to keep disk usage bounded
- Clean distributed state and free ports before relaunches
With checkpointing tamed, my runs were no longer IO-bound and I could iterate safely on hyperparameters and data.