Observability for OpenFold3 NIM#
Use this documentation to learn about observability for OpenFold3 NIM.
About Observability#
OpenFold3 NIM exposes health check endpoints and metrics, enabling monitoring of service health, model performance, resource utilization, and request patterns.
Health Check Endpoints#
OpenFold3 NIM provides health check endpoints that can be used for Kubernetes liveness and readiness probes.
Readiness Check#
Endpoint path: /v1/health/ready
Purpose: Indicates when the container is ready to accept traffic.
Response: Returns HTTP 200 when the service is ready.
Example:
curl http://localhost:8000/v1/health/ready
Liveness Check#
Endpoint path: /v1/health/live
Purpose: Indicates when to restart the container.
Response: Returns HTTP 200 when the service is live.
Example:
curl http://localhost:8000/v1/health/live
These endpoints return HTTP 200 when the service is ready/live and can be used for Kubernetes liveness and readiness probes.
Metrics#
OpenFold3 NIM exposes metrics at the /v1/metrics endpoint. You can view them directly:
curl http://localhost:8000/v1/metrics
Available Metrics Summary#
GPU Metrics:
GPU utilization rate (0.0 - 1.0)
GPU memory usage and total memory (bytes)
GPU power consumption (watts) and total energy (joules)
Request Metrics:
Total request count
Request latency distribution
Process Metrics:
CPU time, memory usage (resident and virtual)
Open file descriptors
Python Runtime:
Garbage collection statistics
Python version information
Collecting Metrics with OpenTelemetry Collector#
OpenTelemetry is a vendor-neutral observability framework that provides a standardized way to collect, process, and export telemetry data (metrics, logs, and traces).
The OpenTelemetry Collector acts as an intermediary service that:
Receives metrics from OpenFold3 NIM’s
/v1/metricsendpoint using a Prometheus receiverProcesses the data (batching, filtering, enrichment)
Exports to various backends (Prometheus, Datadog, New Relic, Zipkin, etc.)
Prerequisites#
An OpenTelemetry Collector must be running and configured to scrape metrics.
OpenTelemetry Collector Configuration#
Create a configuration file otel-collector-config.yaml:
receivers:
prometheus:
config:
scrape_configs:
- job_name: nim-metrics
scrape_interval: 10s
metrics_path: "/v1/metrics"
static_configs:
- targets: ["localhost:8000"]
labels:
service: "openfold3-nim"
processors:
batch:
timeout: 10s
exporters:
# Debug exporter - prints to console
debug:
verbosity: detailed
# Prometheus exporter - exposes metrics on port 8889
prometheus:
endpoint: "0.0.0.0:8889"
namespace: nim
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [debug, prometheus]
Running the Collector#
Start the OpenTelemetry Collector with the configuration:
docker run -d --name otel-collector \
--network host \
-v $(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml \
otel/opentelemetry-collector:latest
The collector will:
Scrape metrics from
localhost:8000/v1/metricsevery 10 secondsExpose aggregated metrics at
http://localhost:8889/metrics
With the collected data, you can export to monitoring backends like Datadog, New Relic, or Zipkin by configuring additional exporters.