For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Overview
    • Quickstart
  • Before You Deploy
    • Infrastructure Sizing
    • Manifest
  • Deployment
    • Installation Overview
    • Image Mirroring
    • Helmfile Installation
  • GPU Cluster Setup
    • GPU Cluster Setup
    • Self-Managed Clusters
  • Configuration
    • Optional Enhancements
    • LLM Function Enablement
    • Gateway Routing
    • Third-Party Registries
    • Registry Allowlist
    • Cluster Configuration
    • KAI Scheduler
  • Using Cloud Functions
    • API
    • Service Keys
    • Function Creation
    • LLM Gateway
    • Generic HTTP Function Invocation
    • gRPC Function Invocation
    • Container Functions
    • Helm Functions
    • Streaming Functions
    • Configure Autoscaling
    • CLI
  • Function Autoscaling
    • Function Autoscaling Overview
    • Architecture
    • Operations
    • Observability
  • Observability
    • Observability
    • Example Dashboards
  • Operations
    • Control Plane Operations
    • Cluster Monitoring
    • Troubleshooting
  • Runbooks
    • Runbooks
    • Key Rotation
  • Reference
    • Cluster Reference
    • gRPC Load Testing
    • gRPC Load Test SLI Guide
    • HTTP Load Testing
    • HTTP Load Test SLI Guide
    • HTTP Soak Testing
  • Development
    • Architecture Overview
    • Local Development
    • Fake GPU Operator
    • Release Process
  • Managed (Legacy)
    • Function Lifecycle
    • Observability
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoCloud Functions
On this page
  • Health endpoints
  • Common operational issues
  • Cassandra connection failures
  • Timeseries database query failures
  • NVCF API errors
  • Discovery is stalled
  • See also
Function Autoscaling

Function Autoscaler Operations

||View as Markdown|
Previous

Architecture

Next

Observability

This page covers operating the function autoscaler after deployment, including health probes, common operational issues, and pointers to the Helm chart values. For log filter syntax, metrics, and traces, see Function Autoscaler Observability.

Health endpoints

The function autoscaler exposes three HTTP health endpoints. Their exact paths differ from the rest of the NVCF control plane: liveness and readiness are namespaced under /admin/health/.

EndpointPurposeUse as
GET /admin/health/livenessAlways returns 200. Indicates the process is alive.Kubernetes liveness probe.
GET /admin/health/readinessReturns 200 when all components are healthy, 503 otherwise.Kubernetes readiness probe.
GET /healthReturns per-component health for cassandra_client and timeseries_db_client.Operator-facing detail and dashboards.

The liveness probe deliberately does not check Cassandra or the timeseries database. Restarting the pod when those are unreachable does not help, so the function autoscaler stays running and lets readiness flip instead.

Common operational issues

Cassandra connection failures

Symptoms: readiness flips to 503, /health reports the cassandra_client component as unhealthy, log lines from rs_autoscaler::cassandra show connection errors.

Checks:

  • SSL certificates are mounted at the path expected by cassandra.ssl. The function autoscaler container expects the cert directory to exist; create /etc/app/config if it is missing.
  • Credentials in the secrets file are valid for the configured keyspace.
  • The contact points resolve from the pod’s network namespace.

Timeseries database query failures

Symptoms: nvcf_autoscaler.timeseries_db.requests_total shows a rising error count, auth_failure_total or server_side_failure_total is non-zero, log lines from rs_autoscaler::timeseries_db show 4xx or 5xx responses.

Checks:

  • timeseries_db.timeseries_db_url is reachable from the pod.
  • The bearer token in the secrets file is current. Token rotation is the most common cause of auth_failure_total spikes.
  • Query time ranges fit the retention window of the backing store.

NVCF API errors

Symptoms: nvcf_autoscaler.nvcf_api.request_duration_milliseconds shows a sustained rise in 4xx or 5xx, scaling decisions stop applying.

Checks:

  • The OAuth2 token endpoint is reachable and the client credentials in the secrets file are valid.
  • The functions being scaled are still in a deployable status. Functions in unexpected states are skipped, not retried.
  • nvcf_api.disable_auth is set as intended for the deployment. Leave it false whenever the NVCF API enforces authentication.

Discovery is stalled

Symptoms: the active function set in Cassandra stops growing despite traffic to new functions, nvcf_autoscaler.distributed_lock.acquisition_failures_total is rising across all replicas.

Checks:

  • Inspect the locks table for the discovery lock row and its TTL. If the row never expires, the previous leader may have stopped refreshing without releasing it.
  • Confirm at least one replica’s nvcf_autoscaler.distributed_lock gauge reports the leader state.
  • Restart the holding replica if the cluster is otherwise healthy. The lock expires within discovery_lock_duration_seconds.

See Architecture for the lock state machine.

See also

  • Function Autoscaler Observability for the metrics and traces referenced in the symptoms above.
  • Configure Autoscaling for setting per-function scaling bounds and policy via the NVCF API.
  • Architecture for the component layout these symptoms map to.