Thomas Kalnik's Blog

This post summarizes how I used Ray to orchestrate the same torchrun launch over multiple nodes. Commands and structure mirror the scripts in my notes repo.

Prometheus and Dashboard (Ray metrics)

ray stop
ray metrics shutdown-prometheus
ray start --head --metrics-export-port=8080
ray metrics launch-prometheus

Import the official Ray Grafana dashboard (ID 14708) via the Grafana UI, or programmatically using the JSON Ray writes under /tmp/ray/session_latest/metrics/grafana/dashboards/.

Why Ray here

Prepare containers on each node reproducibly
Pin tasks to specific nodes
Validate networking with a 1-proc DDP/NCCL probe before full launch
Keep multi-node DDP under torchrun while centralizing orchestration

For details, see the Ray driver layout in my training scripts: container prep step (deps, Accelerate config, LoRA save patch), DDP probe, then the full torchrun launch with ZeRO‑3 and BF16.