DevOps Deep Dive: Grafana + Prometheus for Ray Training

Exact steps to expose Ray metrics and import the official Grafana dashboard.

Grounded steps from my notes to enable Prometheus metrics and Grafana dashboards for multi-node training.

Start Ray with Prometheus and launch Prometheus

ray stop
ray metrics shutdown-prometheus
ray start --head --metrics-export-port=8080
ray metrics launch-prometheus

Import the official Ray dashboard in Grafana

UI path:

  1. Open http://<your-ip>:3000
  2. "+" → "Import"
  3. Enter dashboard ID 14708
  4. Select Prometheus data source → Import

Or import the JSON Ray writes:

cp /tmp/ray/session_latest/metrics/grafana/dashboards/default_grafana_dashboard.json ~/

python3 - <<'PY'
import json, requests
with open('/tmp/ray/session_latest/metrics/grafana/dashboards/default_grafana_dashboard.json','r') as f:
    dashboard = json.load(f)
payload = {
    "dashboard": dashboard,
    "overwrite": True,
    "inputs": [{
        "name": "DS_PROMETHEUS",
        "type": "datasource",
        "pluginId": "prometheus",
        "value": "Prometheus"
    }]
}
r = requests.post('http://admin:admin@localhost:3000/api/dashboards/db', json=payload,
                  headers={'Content-Type': 'application/json'})
print(r.status_code, r.text)
PY

With this, Prometheus scrapes Ray metrics and Grafana visualizes cluster/node/actor/task metrics during training runs.