Observability#
NV-CLIP NIM supports exporting metrics and traces in an OpenTelemetry-compatible format.
Additionally, the underlying Triton service exposes its own metrics through a Prometheus endpoint.
To collect these metrics and traces, export them to a running OpenTelemetry Collector instance, which can then export them to any OTLP-compatible backend.
Metrics#
You can collect metrics from both the NVIDIA NIM for NV-CLIP container and underlying Triton instance.
Triton Metrics#
Triton exposes its metrics on port 8002
in Prometheus format. To collect these metrics, use a Prometheus receiver to scrape the Triton endpoint and export them in an OpenTelemetry compatible format. See the following example for details.
The following table describes the available metrics at http://localhost:8002/metrics.
Category |
Metric |
Metric Name |
Description |
---|---|---|---|
Count |
Success Request Count |
nv_inference_request_success |
Number of successful inference requests |
Count |
Failure Request Count |
nv_inference_request_failure |
Number of failed inference requests |
Count |
Total Request Count |
nv_inference_count |
Number of inferences performed |
Count |
Request Duration |
nv_inference_request_duration_us |
Cumulative inference request duration in microseconds |
Count |
Queue Duration |
nv_inference_queue_duration_us |
Cumulative inference queuing duration in microseconds |
Count |
Inference Duration |
nv_inference_compute_infer_duration_us |
Cumulative compute inference duration in microseconds |
Gauge |
GPU Utilization |
nv_gpu_utilization |
GPU utilization rate [0.0 - 1.0] |
Gauge |
Total GPU Memory |
nv_gpu_memory_total_bytes |
GPU total memory |
Gauge |
Used GPU Memory |
nv_gpu_memory_used_bytes |
GPU used memory |
Gauge |
CPU Utilization |
nv_cpu_utilization |
CPU utilization rate [0.0 - 1.0] |
Gauge |
Total CPU Memory |
nv_cpu_memory_total_bytes |
CPU total memory (RAM) |
Gauge |
Used CPU Memory |
nv_cpu_memory_used_bytes |
CPU used memory (RAM) |
Prometheus#
To install Prometheus for scraping metrics from NIM, download the latest Prometheus version appropriate for your system.
wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar -xvzf prometheus-2.52.0.linux-amd64.tar.gz
cd prometheus-2.52.0.linux-amd64/
Edit the Prometheus configuration file to scrape from the NIM endpoint. Make sure the targets field point to localhost:8002
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:8002"]
Next run Prometheus server ./prometheus --config.file=./prometheus.yml
Use a browser to check that the NIM target was detected by Prometheus server http://localhost:9090/targets?search=
.
You can also click on the NIM target URL link to explore generated metrics.
Grafana#
We can use Grafana for dashboarding NIM metrics. Install the latest Grafana version appropriate for your system.
wget https://dl.grafana.com/oss/release/grafana-11.0.0.linux-amd64.tar.gz
tar -zxvf grafana-11.0.0.linux-amd64.tar.gz
Run the Grafana server
cd grafana-v11.0.0/
./bin/grafana-server
To access the Grafana dashboard point your browser to http://localhost:3000
. You will need to login using the defaults
username: admin
password: admin
The first step is to configure the source for Grafana to scrape metrics from. Click on the Data Source button, select Prometheus and specify the Prometheus URL localhost:9090. After saving the configuration you should see a success message, now you are ready to create a dashboard with metrics from NIM.
Service Metrics#
To enable exporting metrics from the NIM web service, set the NIM_OTEL_SERVICE_NAME
, NIM_OTEL_METRICS_EXPORTER
and NIM_OTEL_EXPORTER_OTLP_ENDPOINT
environment variables when launching the NV-CLIP NIM container.
Traces#
To enable exporting traces from the NIM web service, set the NIM_OTEL_SERVICE_NAME
, NIM_OTEL_TRACES_EXPORTER
and NIM_OTEL_EXPORTER_OTLP_ENDPOINT`
environment variables when launching the NV-CLIP NIM container.
Example#
The following example requires that an instance of the OpenTelemetry Collector is running at <opentelemetry-collector-endpoint>
on port <opentelemetry-collector-port>
.
Launching the NIM Container with OpenTelemetry Enabled#
# Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/nvclip
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)
# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:2.0.0"
# Set the OTEL environment variables to enable metrics exporting
export NIM_OTEL_SERVICE_NAME=$CONTAINER_NAME
export NIM_OTEL_METRICS_EXPORTER=otlp
export NIM_OTEL_TRACES_EXPORTER=otlp
export NIM_OTEL_EXPORTER_OTLP_ENDPOINT="http://<opentelemetry-collector-endpoint>:<opentelemetry-collector-port>"
docker run -it --rm --name=$CONTAINER_NAME \
... \
-e NIM_OTEL_SERVICE_NAME \
-e NIM_OTEL_METRICS_EXPORTER \
-e NIM_OTEL_TRACES_EXPORTER \
-e NIM_OTEL_EXPORTER_OTLP_ENDPOINT \
... \
$IMG_NAME
Receiving and Exporting Telemetry Data with the OpenTelemetry Collector#
The following OpenTelemetry Collector configuration enables both metrics and tracing exports.
Two receivers are defined:
The OTLP receiver is capable of receiving both metrics and trace data from the NIM.
A Prometheus receiver is used for scraping Triton’s own metrics.
Three exporters are defined:
An OTLP exporter to export to a downstream collector or backend. For example, Datadog.
A debug exporter which prints received data to the console. This is useful for testing and development purposes.
Traces are configured to be received exclusively by the OTLP receiver, and exported by the debug exporters. Metrics are configured to be received by both the OTLP and Prometheus receivers, and exported by the OTLP and debug exporters.
receivers:
otlp:
protocols:
grpc:
http:
cors:
allowed_origins:
- "*"
prometheus:
config:
scrape_configs:
- job_name: nim-triton-metrics
scrape_interval: 10s
static_configs:
- targets: ["<nim-endpoint>":8002"]
exporters:
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
exporters: [debug]
metrics:
receivers: [otlp, prometheus]
exporters: [debug]