Large Language Models (RC15)

NIM provides Prometheus metrics indicating request statistics. The metrics can be used for creating dashboards with Grafana dashboard. By default, these metrics are available at http://localhost:8000/metrics.

The following table describes the available metrics.



Metric Name




KV Cache GPU Cache Usage gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage Per model Per iteration
Count Running Count num_requests_running Number of requests currently running on GPU Per model Per iteration

Waiting Count num_requests_waiting Number of requests waiting to be processed Per model Per iteration

Max Request Count num_request_max Max number of concurrently running requests Per model Per iteration

Total Prompt Token Count prompt_tokens_total Number of prefill tokens processed Per model Per iteration

Total Generation Token Count generation_tokens_total Number of generation tokens processed Per model Per iteration
Latency Time to First Token time_to_first_token_seconds Histogram of time to first token in seconds Per model Per request

Time per Output Token time_per_output_token_seconds Histogram of time per output token in seconds Per model Per request

End to End e2e_request_latency_seconds Histogram of end to end request latency in seconds Per model Per request
Count Prompt Token Count request_prompt_tokens Histogram of number of prefill tokens processed Per model Per request

Generation Token Count request_generation_tokens Histogram of number of generation tokens processed Per model Per request

Finished Request Count request_success_total Number of finished requests, with label indicating finish reason Per model Per request

To install Prometheus for scraping metrics from NIM, download the latest Prometheus version appropriate for your system.


wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz tar -xvzf prometheus-2.52.0.linux-amd64.tar.gz cd prometheus-2.52.0.linux-amd64/

Edit the Prometheus configuration file to scrape from the NIM endpoint. Make sure the targets field point to localhost:8000

vi prometheus.yml


# A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["localhost:8000"]

Next run Prometheus server ./prometheus --config.file=./prometheus.yml

Use a browser to check that the NIM target was detected by Prometheus server http://localhost:9090/targets?search=. You can also click on the NIM target url link to explore generated metrics.

We can use Grafana for dashborading NIM metrics. Install the latest Grafana version appropriate for your system.


wget https://dl.grafana.com/oss/release/grafana-11.0.0.linux-amd64.tar.gz tar -zxvf grafana-11.0.0.linux-amd64.tar.gz

Run the Grafana server


cd grafana-v11.0.0/ ./bin/grafana-server

To access the Grafana dashboard point your browser to http://localhost:3000. You will need to login using the defaults


username: admin password: admin

The first step is to congfigure the source for Grafana to scrape metrics from. Click on the “Data Source” button, select Prometheus and specify the Prometheus url localhost:9090. After saving the configuration you should see a success message, now you are ready to create a dashboard with metrics from NIM or you can try this example dashboard.


Previous API Reference
Next Optimization
© Copyright © 2024, NVIDIA Corporation. Last updated on May 30, 2024.