Observability for OpenFold3 NIM#

Use this documentation to learn about observability for OpenFold3 NIM.

About Observability#

OpenFold3 NIM exposes health check endpoints and metrics, enabling monitoring of service health, model performance, resource utilization, and request patterns.

Health Check Endpoints#

OpenFold3 NIM provides health check endpoints that can be used for Kubernetes liveness and readiness probes.

Readiness Check#

Endpoint path: /v1/health/ready

Purpose: Indicates when the container is ready to accept traffic.

Response: Returns HTTP 200 when the service is ready.

Example:

curl http://localhost:8000/v1/health/ready

Liveness Check#

Endpoint path: /v1/health/live

Purpose: Indicates when to restart the container.

Response: Returns HTTP 200 when the service is live.

Example:

curl http://localhost:8000/v1/health/live

These endpoints return HTTP 200 when the service is ready/live and can be used for Kubernetes liveness and readiness probes.

Metrics#

OpenFold3 NIM exposes metrics at the /v1/metrics endpoint. You can view them directly:

curl http://localhost:8000/v1/metrics

Available Metrics Summary#

GPU Metrics:

  • GPU utilization rate (0.0 - 1.0)

  • GPU memory usage and total memory (bytes)

  • GPU power consumption (watts) and total energy (joules)

Request Metrics:

  • Total request count

  • Request latency distribution

Process Metrics:

  • CPU time, memory usage (resident and virtual)

  • Open file descriptors

Python Runtime:

  • Garbage collection statistics

  • Python version information

Collecting Metrics with OpenTelemetry Collector#

OpenTelemetry is a vendor-neutral observability framework that provides a standardized way to collect, process, and export telemetry data (metrics, logs, and traces).

The OpenTelemetry Collector acts as an intermediary service that:

  • Receives metrics from OpenFold3 NIM’s /v1/metrics endpoint using a Prometheus receiver

  • Processes the data (batching, filtering, enrichment)

  • Exports to various backends (Prometheus, Datadog, New Relic, Zipkin, etc.)

Prerequisites#

An OpenTelemetry Collector must be running and configured to scrape metrics.

OpenTelemetry Collector Configuration#

Create a configuration file otel-collector-config.yaml:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: nim-metrics
          scrape_interval: 10s
          metrics_path: "/v1/metrics"
          static_configs:
            - targets: ["localhost:8000"]
              labels:
                service: "openfold3-nim"

processors:
  batch:
    timeout: 10s

exporters:
  # Debug exporter - prints to console
  debug:
    verbosity: detailed
  
  # Prometheus exporter - exposes metrics on port 8889
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: nim

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [debug, prometheus]

Running the Collector#

Start the OpenTelemetry Collector with the configuration:

docker run -d --name otel-collector \
  --network host \
  -v $(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector:latest

The collector will:

  • Scrape metrics from localhost:8000/v1/metrics every 10 seconds

  • Expose aggregated metrics at http://localhost:8889/metrics

With the collected data, you can export to monitoring backends like Datadog, New Relic, or Zipkin by configuring additional exporters.