NeMo Curator supports integration with Prometheus and Grafana, the industry-standard open-source monitoring stack.
Prometheus is a time-series database and monitoring system that:
Grafana is a visualization platform that:
How They Work Together:
When running production pipelines, track these critical metrics:
NeMo Curator provides a built-in script to download, configure, and start Prometheus and Grafana. When RayClient starts, it automatically registers Ray’s metrics with the running Prometheus instance.
On the head node, run the setup script before starting your pipeline:
This downloads Prometheus and Grafana binaries, writes configuration files, starts both services, and auto-generates Ray dashboards (default, data, serve, and serve-deployment).
By default, the script stores metrics data in a per-user directory (/tmp/nemo_curator_metrics_{uid}). To specify a custom location, use --metrics_dir:
You can also customize ports:
Pass the same metrics_dir to RayClient so it can locate the running Prometheus instance and register Ray’s service discovery:
When RayClient.start() runs, it checks whether Prometheus and Grafana are running (through PID files in metrics_dir) and, if so, adds Ray’s metrics service discovery to the Prometheus configuration. When RayClient.stop() runs, it removes the service discovery entry automatically.
SLURM clusters: The metrics_dir passed to RayClient must match the --metrics_dir used when starting Prometheus and Grafana. Similarly, the ray_temp_dir must match the temp_dir used when starting the Ray cluster.
On shared clusters, each user can run an isolated monitoring stack by specifying a unique metrics_dir. Each instance tracks its own Prometheus and Grafana processes through PID files, so multiple users can monitor their pipelines independently on the same node.
To stop Prometheus and Grafana, use the PID files written to your metrics directory: