Observability for OpenFold3 NIM#
Use this documentation to learn about observability for OpenFold3 NIM.
About Observability#
OpenFold3 NIM exposes health check endpoints and metrics, enabling monitoring of service health, model performance, resource utilization, and request patterns.
Health Check Endpoints#
OpenFold3 NIM provides health check endpoints that can be used for Kubernetes liveness and readiness probes.
Readiness Check#
Endpoint path: /v1/health/ready
Purpose: Indicates when the container is ready to accept traffic.
Response: Returns HTTP 200 when the service is ready.
Example:
curl http://localhost:8000/v1/health/ready
Liveness Check#
Endpoint path: /v1/health/live
Purpose: Indicates when to restart the container.
Response: Returns HTTP 200 when the service is live.
Example:
curl http://localhost:8000/v1/health/live
These endpoints return HTTP 200 when the service is ready/live and can be used for Kubernetes liveness and readiness probes.
Metrics#
OpenFold3 NIM exposes metrics at the /v1/metrics endpoint. You can view them directly:
curl http://localhost:8000/v1/metrics
Available Metrics Summary#
GPU Metrics:
GPU utilization rate (0.0 - 1.0)
GPU memory usage and total memory (bytes)
GPU power consumption (watts) and total energy (joules)
Request Metrics:
Total request count
Request latency distribution
Process Metrics:
CPU time, memory usage (resident and virtual)
Open file descriptors
Python Runtime:
Garbage collection statistics
Python version information
Collecting Metrics with OpenTelemetry Collector#
OpenTelemetry is a vendor-neutral observability framework that provides a standardized way to collect, process, and export telemetry data (metrics, logs, and traces).
The OpenTelemetry Collector acts as an intermediary service that:
Receives metrics from OpenFold3 NIM’s
/v1/metricsendpoint using a Prometheus receiverProcesses the data (batching, filtering, enrichment)
Exports to various backends (Prometheus, Datadog, New Relic, Zipkin, etc.)
Prerequisites#
An OpenTelemetry Collector must be running and configured to scrape metrics.
OpenTelemetry Collector Configuration#
Create a configuration file otel-collector-config.yaml:
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: nim-metrics
          scrape_interval: 10s
          metrics_path: "/v1/metrics"
          static_configs:
            - targets: ["localhost:8000"]
              labels:
                service: "openfold3-nim"
processors:
  batch:
    timeout: 10s
exporters:
  # Debug exporter - prints to console
  debug:
    verbosity: detailed
  
  # Prometheus exporter - exposes metrics on port 8889
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: nim
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [debug, prometheus]
Running the Collector#
Start the OpenTelemetry Collector with the configuration:
docker run -d --name otel-collector \
  --network host \
  -v $(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector:latest
The collector will:
Scrape metrics from
localhost:8000/v1/metricsevery 10 secondsExpose aggregated metrics at
http://localhost:8889/metrics
With the collected data, you can export to monitoring backends like Datadog, New Relic, or Zipkin by configuring additional exporters.