DevOps Deep Dive: Ray for Orchestrating Multi-Node Training
Using Ray to prep containers, validate rendezvous/NIC, and launch torchrun consistently.
This post summarizes how I used Ray to orchestrate the same torchrun launch over multiple nodes. Commands and structure mirror the scripts in my notes repo.
Prometheus and Dashboard (Ray metrics)
ray stop
ray metrics shutdown-prometheus
ray start --head --metrics-export-port=8080
ray metrics launch-prometheus
Import the official Ray Grafana dashboard (ID 14708) via the Grafana UI, or programmatically using the JSON Ray writes under /tmp/ray/session_latest/metrics/grafana/dashboards/.
Why Ray here
- Prepare containers on each node reproducibly
- Pin tasks to specific nodes
- Validate networking with a 1-proc DDP/NCCL probe before full launch
- Keep multi-node DDP under
torchrunwhile centralizing orchestration
For details, see the Ray driver layout in my training scripts: container prep step (deps, Accelerate config, LoRA save patch), DDP probe, then the full torchrun launch with ZeRO‑3 and BF16.