Observability#
The Cosmos NIM provides observability features that allow you to monitor system performance and resource usage. This section covers how to access metrics and set up monitoring tools.
Metrics Endpoint#
NIM exposes Prometheus metrics for request statistics and system performance. These metrics can be used to create dashboards in monitoring tools like Grafana.
By default, metrics are available at the following endpoint:
curl -X 'GET' 'http://0.0.0.0:8000/v1/metrics'
Available Metrics#
The following table describes the system metrics available through the metrics endpoint:
Category |
Metric |
Metric Name |
Description |
---|---|---|---|
Python |
GC Objects Collected |
python_gc_objects_collected |
Number of objects collected during garbage collection |
Python |
GC Objects Uncollectable |
python_gc_objects_uncollectable |
Number of objects that could not be collected during garbage collection |
Python |
GC Collections |
python_gc_collections_total |
Number of objects collected by the garbage collector |
Process |
Virtual Memory |
process_virtual_memory_bytes |
Virtual memory size used for the process |
Process |
Resident Memory |
process_resident_memory_bytes |
Physical memory size used for the process |
Process |
CPU Time |
process_cpu_seconds_total |
Total CPU time used for the process |
GPU |
Power Usage |
gpu_power_usage_watts |
Current power consumption of the GPU |
GPU |
Power Limit |
gpu_power_limit_watts |
Maximum power limit configured for the GPU |
GPU |
Energy Consumption |
gpu_total_energy_consumption |
GPU energy consumption |
GPU |
GPU Utilization |
gpu_utilization |
GPU compute utilization percentage |
GPU |
Memory Total |
gpu_memory_total_bytes |
Total memory available on the GPU |
GPU |
Memory Used |
gpu_memory_used_bytes |
Memory currently in use on the GPU |
Note
For more detailed inference-level metrics, you can access the Triton metrics endpoint at http://0.0.0.0:8002/metrics. For more information on these metrics, refer to the Triton Metrics documentation.
Setting Up Monitoring#
This section provides instructions for setting up Prometheus and Grafana to monitor your Cosmos NIM deployment. Follow these steps to install and configure Prometheus for scraping metrics from NIM:
Download the latest Prometheus version for your system:
wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz tar -xvzf prometheus-2.52.0.linux-amd64.tar.gz cd prometheus-2.52.0.linux-amd64/
Configure Prometheus to scrape metrics from the NIM endpoint by editing the
prometheus.yml
file:# A scrape configuration containing exactly one endpoint to scrape scrape_configs: - job_name: "nim-metrics" static_configs: - targets: ["localhost:8000"]
Start the Prometheus server:
./prometheus --config.file=./prometheus.yml
Verify the setup by opening a web browser and navigating to
http://localhost:9090/targets
. You should see the NIM target listed with a status of “UP”.
Grafana Setup#
Follow these steps to set up Grafana for visualizing NIM metrics:
Download and install the latest Grafana version for your system:
wget https://dl.grafana.com/oss/release/grafana-11.0.0.linux-amd64.tar.gz tar -zxvf grafana-11.0.0.linux-amd64.tar.gz cd grafana-v11.0.0/
Start the Grafana server:
./bin/grafana-server
Access the Grafana web interface by opening a browser and navigating to
http://localhost:3000
. Log in using the default credentials:Username: admin Password: admin
Configure Prometheus as a data source:
Navigate to Connections > Data sources in the Grafana sidebar.
Click Add data source and select “Prometheus”.
Set the URL to
http://localhost:9090
.Click Save & test to verify the connection.
Creating Dashboards#
Once you have Prometheus and Grafana set up, you can create dashboards to visualize NIM metrics:
In Grafana, click on “Dashboards” in the sidebar and then New > New Dashboard.
Click Add visualization.
Select your Prometheus data source.
Use the query builder to select metrics such as
gpu_utilization
,gpu_memory_used_bytes
, orprocess_cpu_seconds_total
.Configure the visualization settings and add the panel to your dashboard.
Repeat the above steps for additional metrics you want to monitor.
For more detailed instructions on building Grafana dashboards, refer to the Grafana Fundamentals tutorial.
Tip
Refer to the troubleshooting page if you are encountering issues with metrics collection or visualization.