Observability#

NIM provides Prometheus metrics indicating request statistics. The metrics can be used for creating dashboards with Grafana dashboard. By default, these metrics are available at http://localhost:8000/v1/metrics.

The following table describes the available metrics.

Category

Metric

Metric Name

Description

Granularity

Frequency

KV Cache

GPU Cache Usage

gpu_cache_usage_perc

GPU KV-cache usage. 1 means 100 percent usage

Per model

Per iteration

Count

Running Count

num_requests_running

Number of requests currently running on GPU

Per model

Per iteration

Waiting Count

num_requests_waiting

Number of requests waiting to be processed

Per model

Per iteration

Max Request Count

num_request_max

Max number of concurrently running requests

Per model

Per iteration

Total Prompt Token Count

prompt_tokens_total

Number of prefill tokens processed

Per model

Per iteration

Total Generation Token Count

generation_tokens_total

Number of generation tokens processed

Per model

Per iteration

Latency

Time to First Token

time_to_first_token_seconds

Histogram of time to first token in seconds

Per model

Per request

Time per Output Token

time_per_output_token_seconds

Histogram of time per output token in seconds

Per model

Per request

End to End

e2e_request_latency_seconds

Histogram of end to end request latency in seconds

Per model

Per request

Count

Prompt Token Count

request_prompt_tokens

Histogram of number of prefill tokens processed

Per model

Per request

Generation Token Count

request_generation_tokens

Histogram of number of generation tokens processed

Per model

Per request

Finished Request Count

request_finish_total

Number of finished requests, with label indicating finish reason

Per model

Per request

Success Request Count

request_success_total

Number of successful requests, requests with finish reason “stop” or “length” are counted

Per model

Per request

Failure Request Count

request_failure_total

Number of failed requests, requests with other finish reason are counted

Per model

Per request

Prometheus#

To install Prometheus for scraping metrics from NIM, download the latest Prometheus version appropriate for your system.

wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar -xvzf prometheus-2.52.0.linux-amd64.tar.gz
cd prometheus-2.52.0.linux-amd64/

Edit the Prometheus configuration file to scrape from the NIM endpoint. Make sure the targets field point to localhost:8000

vi prometheus.yml

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
	# Previous versions use '/v1/metrics'.
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:8000"]

Next run Prometheus server ./prometheus --config.file=./prometheus.yml

Use a browser to check that the NIM target was detected by Prometheus server http://localhost:9090/targets?search=. You can also click on the NIM target URL link to explore generated metrics.

Grafana#

We can use Grafana for dashboarding NIM metrics. Install the latest Grafana version appropriate for your system.

wget https://dl.grafana.com/oss/release/grafana-11.0.0.linux-amd64.tar.gz
tar -zxvf grafana-11.0.0.linux-amd64.tar.gz

Run the Grafana server

cd grafana-v11.0.0/
./bin/grafana-server

To access the Grafana dashboard point your browser to http://localhost:3000. You will need to login using the defaults

username: admin 
password: admin

The first step is to configure the source for Grafana to scrape metrics from. Click on the “Data Source” button, select Prometheus and specify the Prometheus URL localhost:9090. After saving the configuration you should see a success message, now you are ready to create a dashboard with metrics from NIM or you can try this example dashboard.

NIM Dashboard Example