Monitoring
NeMo Curator supports integration with Prometheus and Grafana, the industry-standard open-source monitoring stack.
Prometheus is a time-series database and monitoring system that:
- Collects metrics from your pipeline at regular intervals (for example, every 15 seconds).
- Stores metrics like CPU usage, GPU memory, worker counts, and task throughput.
- Provides a query language (PromQL) to aggregate and analyze metrics.
- Runs as a standalone service that scrapes metrics exposed by Curator workers.
Grafana is a visualization platform that:
- Connects to Prometheus as a data source.
- Displays metrics in customizable dashboards with graphs, gauges, and alerts.
- Provides real-time views of your pipeline’s health and performance.
- Allows you to set up alerts (for example, notify when GPU memory exceeds 90%).
How They Work Together:
- Curator workers expose metrics in a format Prometheus understands.
- Prometheus periodically scrapes these metrics and stores them.
- Grafana queries Prometheus and displays the data in dashboards.
- You view the dashboards to monitor your pipeline in real time and historically.
Key Metrics to Monitor
When running production pipelines, track these critical metrics:
- CPU Memory Usage: Total RAM consumption across workers to prevent out-of-memory errors.
- GPU Memory Usage: VRAM consumption per GPU for model-based stages (classifiers, embedders).
- Worker Count: Number of active workers per stage to verify proper scaling.
- Task Throughput: Documents or batches processed per second to measure pipeline efficiency.
- Stage Latency: Time spent in each pipeline stage to identify bottlenecks.
- Error Rates: Failed tasks or worker crashes to detect stability issues.
Setting Up Monitoring
NeMo Curator provides a built-in script to download, configure, and start Prometheus and Grafana. When RayClient starts, it automatically registers Ray’s metrics with the running Prometheus instance.
1. Start Prometheus and Grafana
On the head node, run the setup script before starting your pipeline:
This downloads Prometheus and Grafana binaries, writes configuration files, starts both services, and auto-generates Ray dashboards (default, data, serve, and serve-deployment).
By default, the script stores metrics data in a per-user directory (/tmp/nemo_curator_metrics_{uid}). To specify a custom location, use --metrics_dir:
You can also customize ports:
2. Configure RayClient
Pass the same metrics_dir to RayClient so it can locate the running Prometheus instance and register Ray’s service discovery:
When RayClient.start() runs, it checks whether Prometheus and Grafana are running (through PID files in metrics_dir) and, if so, adds Ray’s metrics service discovery to the Prometheus configuration. When RayClient.stop() runs, it removes the service discovery entry automatically.
SLURM clusters: The metrics_dir passed to RayClient must match the --metrics_dir used when starting Prometheus and Grafana. Similarly, the ray_temp_dir must match the temp_dir used when starting the Ray cluster.
Multi-User Clusters
On shared clusters, each user can run an isolated monitoring stack by specifying a unique metrics_dir. Each instance tracks its own Prometheus and Grafana processes through PID files, so multiple users can monitor their pipelines independently on the same node.
Stopping Monitoring Services
To stop Prometheus and Grafana, use the PID files written to your metrics directory: