Observability#

The Cosmos NIM provides observability features that allow you to monitor system performance and resource usage. This section covers how to access metrics and set up monitoring tools.

Metrics Endpoint#

NIM exposes Prometheus metrics for request statistics and system performance. These metrics can be used to create dashboards in monitoring tools like Grafana.

By default, metrics are available at the following endpoint:

curl -X 'GET' 'http://0.0.0.0:8000/v1/metrics'

Available Metrics#

The following table describes the system metrics available through the metrics endpoint:

Category

Metric

Metric Name

Description

Python

GC Objects Collected

python_gc_objects_collected

Number of objects collected during garbage collection

Python

GC Objects Uncollectable

python_gc_objects_uncollectable

Number of objects that could not be collected during garbage collection

Python

GC Collections

python_gc_collections_total

Number of objects collected by the garbage collector

Process

Virtual Memory

process_virtual_memory_bytes

Virtual memory size used for the process

Process

Resident Memory

process_resident_memory_bytes

Physical memory size used for the process

Process

CPU Time

process_cpu_seconds_total

Total CPU time used for the process

GPU

Power Usage

gpu_power_usage_watts

Current power consumption of the GPU

GPU

Power Limit

gpu_power_limit_watts

Maximum power limit configured for the GPU

GPU

Energy Consumption

gpu_total_energy_consumption

GPU energy consumption

GPU

GPU Utilization

gpu_utilization

GPU compute utilization percentage

GPU

Memory Total

gpu_memory_total_bytes

Total memory available on the GPU

GPU

Memory Used

gpu_memory_used_bytes

Memory currently in use on the GPU

Note

For more detailed inference-level metrics, you can access the Triton metrics endpoint at http://0.0.0.0:8002/metrics. For more information on these metrics, refer to the Triton Metrics documentation.

Setting Up Monitoring#

This section provides instructions for setting up Prometheus and Grafana to monitor your Cosmos NIM deployment. Follow these steps to install and configure Prometheus for scraping metrics from NIM:

  1. Download the latest Prometheus version for your system:

    wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
    tar -xvzf prometheus-2.52.0.linux-amd64.tar.gz
    cd prometheus-2.52.0.linux-amd64/
    
  2. Configure Prometheus to scrape metrics from the NIM endpoint by editing the prometheus.yml file:

    # A scrape configuration containing exactly one endpoint to scrape
    scrape_configs:
      - job_name: "nim-metrics"
        static_configs:
          - targets: ["localhost:8000"]
    
  3. Start the Prometheus server:

    ./prometheus --config.file=./prometheus.yml
    
  4. Verify the setup by opening a web browser and navigating to http://localhost:9090/targets. You should see the NIM target listed with a status of “UP”.

Grafana Setup#

Follow these steps to set up Grafana for visualizing NIM metrics:

  1. Download and install the latest Grafana version for your system:

    wget https://dl.grafana.com/oss/release/grafana-11.0.0.linux-amd64.tar.gz
    tar -zxvf grafana-11.0.0.linux-amd64.tar.gz
    cd grafana-v11.0.0/
    
  2. Start the Grafana server:

    ./bin/grafana-server
    
  3. Access the Grafana web interface by opening a browser and navigating to http://localhost:3000. Log in using the default credentials:

    Username: admin
    Password: admin
    
  4. Configure Prometheus as a data source:

    1. Navigate to Connections > Data sources in the Grafana sidebar.

    2. Click Add data source and select “Prometheus”.

    3. Set the URL to http://localhost:9090.

    4. Click Save & test to verify the connection.

Creating Dashboards#

Once you have Prometheus and Grafana set up, you can create dashboards to visualize NIM metrics:

  1. In Grafana, click on “Dashboards” in the sidebar and then New > New Dashboard.

  2. Click Add visualization.

  3. Select your Prometheus data source.

  4. Use the query builder to select metrics such as gpu_utilization, gpu_memory_used_bytes, or process_cpu_seconds_total.

  5. Configure the visualization settings and add the panel to your dashboard.

  6. Repeat the above steps for additional metrics you want to monitor.

For more detailed instructions on building Grafana dashboards, refer to the Grafana Fundamentals tutorial.

Tip

Refer to the troubleshooting page if you are encountering issues with metrics collection or visualization.