DevOps posts

DevOps Deep Dive: Docker for Multi-Node LLM Training

Container recipe and health checks used to keep multi-node LLaMA-Factory runs consistent.

DevOps Deep Dive: Grafana + Prometheus for Ray Training

Exact steps to expose Ray metrics and import the official Grafana dashboard.

DevOps Deep Dive: Ray for Orchestrating Multi-Node Training

Using Ray to prep containers, validate rendezvous/NIC, and launch torchrun consistently.

DevOps Deep Dive: torchrun for Stable Multi-Node SFT

Exact environment and launch patterns from production torchrun scripts, including DDP probe and ZeRO-3 settings.