Observability for NVIDIA NIM for Object Detection#

Use this documentation to learn about observability for NVIDIA NIM for Object Detection.

About Observability#

NVIDIA NIM for Object Detection supports exporting metrics and traces in an OpenTelemetry-compatible format. Additionally, the microservice and its underlying NVIDIA Triton Inference Server expose metrics through Prometheus endpoints.

To collect these metrics and traces, export them to a running OpenTelemetry Collector instance, which can then export them to any OTLP-compatible backend.

Metrics and Traces#

You can collect metrics from both the NIM microservice and the Triton Inference Server instance.

The following environment variables are related to exporting OpenTelemetry metrics and traces from the NIM microservice.

Variable

Description

OTEL_SERVICE_NAME

Specifies the name of the service to use in the exported metrics.

OTEL_EXPORTER_OTLP_ENDPOINT

Specifies the endpoint of an OTLP gRPC receiver.

OTEL_METRICS_EXPORTER="otlp"

Specifies to export metrics to the specified OTEL_EXPORTER_OTLP_ENDPOINT in OTLP format. By default, metrics are printed to the container log.

OTEL_TRACES_EXPORTER="otlp"

Specifies to export traces to the specified OTEL_EXPORTER_OTLP_ENDPOINT in OTLP format. By default, traces are printed to the container log.

The NIM microservice and Triton Inference Server also expose metrics in Prometheus format. You can access these metrics through the NIM microservice API at <nim-host>:8000/v1/metrics, and the Triton metrics endpoint at <nim-host>:8002/metrics, respectively.

Enabling OpenTelemetry#

The following example requires that an OpenTelemetry Collector gRPC receiver is running at <opentelemetry-collector-host> on port <opentelemetry-collector-grpc-port>.

export IMG_NAME=nvcr.io/nim/nvidia/nemoretriever-page-elements-v2
export IMG_TAG=1.2.0

# Choose a container name for bookkeeping
export CONTAINER_NAME=$(basename $IMG_NAME)

# Set the OTEL environment variables to enable metrics exporting
export OTEL_SERVICE_NAME=$CONTAINER_NAME
export OTEL_METRICS_EXPORTER=otlp
export OTEL_TRACES_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT="http://<opentelemetry-collector-host>:<opentelemetry-collector-grpc-port>"

docker run --runtime=nvidia -it --rm --name=$CONTAINER_NAME \
  ... \
  -e OTEL_SERVICE_NAME \
  -e OTEL_METRICS_EXPORTER \
  -e OTEL_TRACES_EXPORTER \
  -e OTEL_EXPORTER_OTLP_ENDPOINT \
  ... \
  $IMG_NAME:$IMG_TAG

Receiving and Exporting Telemetry Data#

The following OpenTelemetry Collector configuration enables both metrics and tracing exports.

Two receivers are defined:

  • An OTLP receiver that receives both metrics and trace data from the NIM microservice.

  • A Prometheus receiver scrapes Triton Inference Server metrics.

Three exporters are defined:

  • A Zipkin exporter that exports to a running Zipkin instance.

  • An OTLP gRPC exporter that exports to a downstream collector or backend, such as Datadog.

  • A debug exporter that prints received data to the console. This exporter is helpful for testing and development purposes.

Traces are received exclusively by the OTLP receiver and exported by both the Zipkin and debug exporters. Metrics are received by the OTLP and Prometheus receivers. The metrics are exported by the OTLP and debug exporters.

receivers:
  otlp:
    protocols:
      grpc:
      http:
        cors:
          allowed_origins:
            - "*"
  prometheus:
    config:
      scrape_configs:
        - job_name: nim-triton-metrics
          scrape_interval: 10s
          static_configs:
            - targets: ["<nim-endpoint>:8002"]
exporters:
  # NOTE: Prior to v0.86.0 use `logging` instead of `debug`.
  zipkin:
    endpoint: "<zipkin-endpoint>:<zipkin-port>/api/v2/spans"
  otlp:
    endpoint: "<otlp-metrics-endpoint>:<otlp-metrics-port>"
    tls:
      insecure: true
  debug:
    verbosity: detailed
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [debug, zipkin]
    metrics:
      receivers: [otlp, prometheus]
      exporters: [debug, otlp]

Prometheus#

To install Prometheus for scraping metrics from NIM, download the latest Prometheus version appropriate for your system.

wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar -xvzf prometheus-2.52.0.linux-amd64.tar.gz
cd prometheus-2.52.0.linux-amd64/

Edit the Prometheus configuration file to scrape from the NIM endpoint. Make sure the targets field points to localhost:8002.

vim prometheus.yml

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:8002"]

Next run Prometheus server ./prometheus --config.file=./prometheus.yml

Use a browser to check that the NIM target was detected by Prometheus server http://localhost:9090/targets?search=. You can also click the NIM target URL link to explore generated metrics.

Grafana#

We can use Grafana for creating dashboards for NIM metrics. Install the latest Grafana version appropriate for your system.

wget https://dl.grafana.com/oss/release/grafana-11.0.0.linux-amd64.tar.gz
tar -zxvf grafana-11.0.0.linux-amd64.tar.gz

Run the Grafana server

cd grafana-v11.0.0/
./bin/grafana-server

To access the Grafana dashboard point your browser to http://localhost:3000. You need to login by using the following default credentials.

username: admin 
password: admin

The first step is to configure the source for Grafana. Click Data Source, select Prometheus, and then specify the Prometheus URL localhost:9090. Save the configuration and you should see a success message. Now you are ready to create a dashboard with metrics from NIM.