Observability for OpenFold3 NIM#

Use this documentation to learn about observability for OpenFold3 NIM.

About Observability#

OpenFold3 NIM exposes health check endpoints and metrics, enabling monitoring of service health, model performance, resource utilization, and request patterns.

Health Check Endpoints#

OpenFold3 NIM provides health check endpoints that can be used for Kubernetes liveness and readiness probes.

Readiness Check#

Endpoint path: /v1/health/ready

Purpose: Indicates when the container is ready to accept traffic.

Response: Returns HTTP 200 when the service is ready.

Example:

curl http://localhost:8000/v1/health/ready

Liveness Check#

Endpoint path: /v1/health/live

Purpose: Indicates when to restart the container.

Response: Returns HTTP 200 when the service is live.

Example:

curl http://localhost:8000/v1/health/live

These endpoints return HTTP 200 when the service is ready/live and can be used for Kubernetes liveness and readiness probes.

Metrics#

OpenFold3 NIM exposes metrics at the /v1/metrics endpoint. You can view them directly:

curl http://localhost:8000/v1/metrics

Available Metrics Summary#

GPU Metrics:

  • GPU utilization rate (0.0 - 1.0)

  • GPU memory usage and total memory (bytes)

  • GPU power consumption (watts) and total energy (joules)

Request Metrics:

  • Total request count

  • Request latency distribution

Process Metrics:

  • CPU time, memory usage (resident and virtual)

  • Open file descriptors

Python Runtime:

  • Garbage collection statistics

  • Python version information

Collecting Metrics with OpenTelemetry Collector#

OpenTelemetry is a vendor-neutral observability framework that provides a standardized way to collect, process, and export telemetry data (metrics, logs, and traces).

The OpenTelemetry Collector acts as an intermediary service that:

  • Receives metrics from OpenFold3 NIM’s /v1/metrics endpoint using a Prometheus receiver

  • Processes the data (batching, filtering, enrichment)

  • Exports to various backends (Prometheus, Datadog, New Relic, Zipkin, etc.)

Prerequisites#

An OpenTelemetry Collector must be running and configured to scrape metrics.

OpenTelemetry Collector Configuration#

Create a configuration file otel-collector-config.yaml:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: nim-metrics
          scrape_interval: 10s
          metrics_path: "/v1/metrics"
          static_configs:
            - targets: ["localhost:8000"]
              labels:
                service: "openfold3-nim"

processors:
  batch:
    timeout: 10s

exporters:
  # Debug exporter - prints to console
  debug:
    verbosity: detailed
  
  # Prometheus exporter - exposes metrics on port 8889
  prometheus:
    endpoint: "0.0.0.0:8888"
    namespace: nim

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [debug, prometheus]

Checking for Port Conflicts#

Before running the OpenTelemetry Collector, check if port 8888 is already in use:

# On Linux/macOS
lsof -i :8888

# Or using netstat
netstat -tuln | grep 8888

If port 8888 is occupied, you’ll need to use the alternative configuration below that uses different ports.

Alternative Configuration for Port Conflicts#

If port 8888 is already in use, use this configuration (otel-collector-config.yaml), which binds to ports 8887 and 8889 instead:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "nim-app"
          metrics_path: "/v1/metrics"
          static_configs:
            - targets: ["localhost:8000"]

processors:
  batch: {}

exporters:
  debug:
    verbosity: detailed

  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: nim

service:
  telemetry:
    metrics:
      readers:
        - pull:
            exporter:
              prometheus:
                host: "0.0.0.0"
                port: 8887

  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [debug, prometheus]

This configuration:

  • Uses port 8887 for the collector’s internal telemetry metrics

  • Uses port 8889 for the Prometheus exporter endpoint

  • Avoids conflicts with services running on port 8888

Running the Collector#

Start the OpenTelemetry Collector with the configuration:

docker run -d --name otel-collector \
  --network host \
  -v $(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector:latest

The collector will:

  • Scrape metrics from localhost:8000/v1/metrics every 10 seconds

  • Expose aggregated metrics at http://localhost:8889/metrics

  • Expose collector’s own telemetry at http://localhost:8887/metrics (if using alternative configuration)

With the collected data, you can export to monitoring backends like Datadog, New Relic, or Zipkin by configuring additional exporters.