DevOps Deep Dive: Grafana + Prometheus for Ray Training
Exact steps to expose Ray metrics and import the official Grafana dashboard.
Grounded steps from my notes to enable Prometheus metrics and Grafana dashboards for multi-node training.
Start Ray with Prometheus and launch Prometheus
ray stop
ray metrics shutdown-prometheus
ray start --head --metrics-export-port=8080
ray metrics launch-prometheus
Import the official Ray dashboard in Grafana
UI path:
- Open
http://<your-ip>:3000 - "+" → "Import"
- Enter dashboard ID
14708 - Select
Prometheusdata source → Import
Or import the JSON Ray writes:
cp /tmp/ray/session_latest/metrics/grafana/dashboards/default_grafana_dashboard.json ~/
python3 - <<'PY'
import json, requests
with open('/tmp/ray/session_latest/metrics/grafana/dashboards/default_grafana_dashboard.json','r') as f:
dashboard = json.load(f)
payload = {
"dashboard": dashboard,
"overwrite": True,
"inputs": [{
"name": "DS_PROMETHEUS",
"type": "datasource",
"pluginId": "prometheus",
"value": "Prometheus"
}]
}
r = requests.post('http://admin:admin@localhost:3000/api/dashboards/db', json=payload,
headers={'Content-Type': 'application/json'})
print(r.status_code, r.text)
PY
With this, Prometheus scrapes Ray metrics and Grafana visualizes cluster/node/actor/task metrics during training runs.