DevOps Deep Dive: Ray for Orchestrating Multi-Node Training

Using Ray to prep containers, validate rendezvous/NIC, and launch torchrun consistently.

This post summarizes how I used Ray to orchestrate the same torchrun launch over multiple nodes. Commands and structure mirror the scripts in my notes repo.

Prometheus and Dashboard (Ray metrics)

ray stop
ray metrics shutdown-prometheus
ray start --head --metrics-export-port=8080
ray metrics launch-prometheus

Import the official Ray Grafana dashboard (ID 14708) via the Grafana UI, or programmatically using the JSON Ray writes under /tmp/ray/session_latest/metrics/grafana/dashboards/.

Why Ray here

  • Prepare containers on each node reproducibly
  • Pin tasks to specific nodes
  • Validate networking with a 1-proc DDP/NCCL probe before full launch
  • Keep multi-node DDP under torchrun while centralizing orchestration

For details, see the Ray driver layout in my training scripts: container prep step (deps, Accelerate config, LoRA save patch), DDP probe, then the full torchrun launch with ZeRO‑3 and BF16.