Training DeepSeek V3 on 24× A100s — Part 5: Checkpointing, LoRA Saves, and the Janitor

How I avoided 400+ GB checkpoint explosions, fixed empty LoRA saves, and kept NFS from freezing the cluster.

Saving models sounds trivial until ZeRO-3 shards and optimizer states enter the chat. I learned to be opinionated about what gets saved, where it gets saved, and what gets deleted immediately after.

The problem: DeepSpeed wants to save everything

My early runs wrote gargantuan checkpoint trees inside Docker’s overlay filesystem:

/var/lib/docker/overlay2/.../diff/tmp/deepseek_v3_lora_smoketest/checkpoint-2/global_step2/
  zero_pp_rank_0_mp_rank_00_model_states.pt  (~53 GB)
  zero_pp_rank_1_mp_rank_00_model_states.pt  (~53 GB)
  ...

That directory ballooned past 400 GB on a single host. Two compounding issues:

  • Saving optimizer state and ZeRO shards I didn’t need for a LoRA-only run
  • Writing into container overlay instead of fast node-local storage

What I saved (and what I didn’t)

I made saves strictly adapter-centric:

--save_only_model true
--save_safetensors true
--save_on_each_node false
--output_dir /host_home/deepseek_v3_lora_prod_run_1

And in DeepSpeed, I ensured the model weights could be gathered on save, otherwise PEFT could produce an empty adapter header if target modules weren’t found:

"stage3_gather_16bit_weights_on_model_save": true

I also fixed my LoRA target list to match DeepSeek’s module names:

--lora_target q_proj,v_proj,k_proj,o_proj

This combination solved the “saved nothing” issue and kept artifacts small.

Don’t write checkpoints to NFS during training

NFS worked great for models and datasets. It did not enjoy checkpoint storms. I pointed --output_dir to node-local storage (mounted in the container at /host_home/...) and synced artifacts later.

When I accidentally saved to NFS, the head node stalled so badly that even SSH banner exchange timed out while TCP connects succeeded — classic IO saturation.

The janitor: deleting the right files constantly

Even with --save_only_model, libraries can still emit ZeRO/optimizer artifacts at times. I ran a tiny background script inside each container to nuke anything I didn’t want, every few seconds.

#!/usr/bin/env bash
set -euo pipefail
DIR="${OUTPUT_DIR:-/app/local_checkpoints/deepseek_v3_lora_smoketest}"
while true; do
  find "$DIR" -type f -name "*_optim_states.pt" -delete 2>/dev/null
  find "$DIR" -type f -name "*_model_states.pt" -delete 2>/dev/null
  find "$DIR" -type f -name "optimizer.pt" -delete 2>/dev/null
  find "$DIR" -type f -name "scheduler.pt" -delete 2>/dev/null
  find "$DIR" -type f -name "rng_state.pth" -delete 2>/dev/null
  find "$DIR" -type f -name "zero_pp_rank_*_model_states.pt" -delete 2>/dev/null
  sleep 3
done

I launched it alongside training:

nohup /tmp/ckpt_janitor.sh >/tmp/ckpt_janitor.log 2>&1 &

Yes, it’s blunt. It prevented both disk exhaustion and IO-related stalls.

Clean restarts without ghosts

After failures, I used a consistent cleanup sequence across nodes to remove distributed state and free the rendezvous port:

sudo pkill -9 -f torchrun || true
sudo pkill -9 -f train.py || true
sudo nvidia-smi --gpu-reset || true
sudo rm -rf /tmp/.torch_distributed_* /dev/shm/torch_* /tmp/cuda_* ~/.nv/
sudo fuser -k 39500/tcp 2>/dev/null || true

I wrapped this in cleanup_post_training_state.sh and kept a separate kill_docker_deepseek_training.sh to kill processes inside containers across nodes in parallel.

What finally worked for me

  • Save only adapters and in safetensors
  • Gather 16-bit weights on save for ZeRO-3
  • Point output to node-local storage; never NFS during training
  • Run the janitor to keep disk usage bounded
  • Clean distributed state and free ports before relaunches

With checkpointing tamed, my runs were no longer IO-bound and I could iterate safely on hyperparameters and data.