Observability for OpenFold3 NIM#
Use this documentation to learn about observability for OpenFold3 NIM.
About Observability#
OpenFold3 NIM exposes health check endpoints and metrics, enabling monitoring of service health, model performance, resource utilization, and request patterns.
Health Check Endpoints#
OpenFold3 NIM provides health check endpoints that can be used for Kubernetes liveness and readiness probes.
Readiness Check#
Endpoint path: /v1/health/ready
Purpose: Indicates when the container is ready to accept traffic.
Response: Returns HTTP 200 when the service is ready.
Example:
curl http://localhost:8000/v1/health/ready
Liveness Check#
Endpoint path: /v1/health/live
Purpose: Indicates when to restart the container.
Response: Returns HTTP 200 when the service is live.
Example:
curl http://localhost:8000/v1/health/live
These endpoints return HTTP 200 when the service is ready/live and can be used for Kubernetes liveness and readiness probes.
Metrics#
OpenFold3 NIM exposes metrics at the /v1/metrics endpoint. You can view them directly:
curl http://localhost:8000/v1/metrics
Available Metrics Summary#
GPU Metrics:
GPU utilization rate (0.0 - 1.0)
GPU memory usage and total memory (bytes)
GPU power consumption (watts) and total energy (joules)
Request Metrics:
Total request count
Request latency distribution
Process Metrics:
CPU time, memory usage (resident and virtual)
Open file descriptors
Python Runtime:
Garbage collection statistics
Python version information
Collecting Metrics with OpenTelemetry Collector#
OpenTelemetry is a vendor-neutral observability framework that provides a standardized way to collect, process, and export telemetry data (metrics, logs, and traces).
The OpenTelemetry Collector acts as an intermediary service that:
Receives metrics from OpenFold3 NIM’s
/v1/metricsendpoint using a Prometheus receiverProcesses the data (batching, filtering, enrichment)
Exports to various backends (Prometheus, Datadog, New Relic, Zipkin, etc.)
Prerequisites#
An OpenTelemetry Collector must be running and configured to scrape metrics.
OpenTelemetry Collector Configuration#
Create a configuration file otel-collector-config.yaml:
receivers:
prometheus:
config:
scrape_configs:
- job_name: nim-metrics
scrape_interval: 10s
metrics_path: "/v1/metrics"
static_configs:
- targets: ["localhost:8000"]
labels:
service: "openfold3-nim"
processors:
batch:
timeout: 10s
exporters:
# Debug exporter - prints to console
debug:
verbosity: detailed
# Prometheus exporter - exposes metrics on port 8889
prometheus:
endpoint: "0.0.0.0:8888"
namespace: nim
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [debug, prometheus]
Checking for Port Conflicts#
Before running the OpenTelemetry Collector, check if port 8888 is already in use:
# On Linux/macOS
lsof -i :8888
# Or using netstat
netstat -tuln | grep 8888
If port 8888 is occupied, you’ll need to use the alternative configuration below that uses different ports.
Alternative Configuration for Port Conflicts#
If port 8888 is already in use, use this configuration (otel-collector-config.yaml), which binds to ports 8887 and 8889 instead:
receivers:
prometheus:
config:
scrape_configs:
- job_name: "nim-app"
metrics_path: "/v1/metrics"
static_configs:
- targets: ["localhost:8000"]
processors:
batch: {}
exporters:
debug:
verbosity: detailed
prometheus:
endpoint: "0.0.0.0:8889"
namespace: nim
service:
telemetry:
metrics:
readers:
- pull:
exporter:
prometheus:
host: "0.0.0.0"
port: 8887
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [debug, prometheus]
This configuration:
Uses port 8887 for the collector’s internal telemetry metrics
Uses port 8889 for the Prometheus exporter endpoint
Avoids conflicts with services running on port 8888
Running the Collector#
Start the OpenTelemetry Collector with the configuration:
docker run -d --name otel-collector \
--network host \
-v $(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml \
otel/opentelemetry-collector:latest
The collector will:
Scrape metrics from
localhost:8000/v1/metricsevery 10 secondsExpose aggregated metrics at
http://localhost:8889/metricsExpose collector’s own telemetry at
http://localhost:8887/metrics(if using alternative configuration)
With the collected data, you can export to monitoring backends like Datadog, New Relic, or Zipkin by configuring additional exporters.