DevOps posts
DevOps Deep Dive: Docker for Multi-Node LLM Training
Container recipe and health checks used to keep multi-node LLaMA-Factory runs consistent.
DevOps Deep Dive: Grafana + Prometheus for Ray Training
Exact steps to expose Ray metrics and import the official Grafana dashboard.
DevOps Deep Dive: Ray for Orchestrating Multi-Node Training
Using Ray to prep containers, validate rendezvous/NIC, and launch torchrun consistently.
DevOps Deep Dive: torchrun for Stable Multi-Node SFT
Exact environment and launch patterns from production torchrun scripts, including DDP probe and ZeRO-3 settings.