Overview#
The NVIDIA Cluster Agent (NVCA) connects GPU clusters to the NVCF control plane, enabling them to act as deployment targets for Cloud Functions. NVCA is a function deployment orchestrator that registers a cluster’s GPU resources, communicates with the control plane, and manages the lifecycle of function deployments on GPU nodes.
After installing NVCA on a cluster:
The registered cluster will show as a deployment option in the
GET /v2/nvcf/clusterGroupsAPI response, and the Cloud Functions deployment menu.Any functions under the cluster’s authorized NCA IDs can now deploy on the cluster.
Management Modes#
NVCA supports multiple management modes depending on how your NVCF environment is deployed:
NGC-Managed — Cluster is registered and configured through the NGC UI. Configuration changes are applied via the web interface. See NGC-Managed Clusters.
Helm-Managed — Cluster configuration is defined in Helm values and applied through
helm upgradecommands, enabling GitOps workflows. The NGC UI becomes read-only. See Helm-Managed Clusters.
Authentication & Keys#
Different key types are used depending on your deployment mode:
Key Type |
Description |
Used In |
|---|---|---|
NGC Personal API Key |
User-scoped key from ngc.nvidia.com. Used for registry authentication and API access. |
NGC authentication, pulling images and charts from NGC |
NGC Cluster Key |
Cluster-scoped key generated during cluster registration via the NGC UI. Used by the NVCA Operator to authenticate with the NGC control plane. Expires after 90 days and must be rotated. |
NGC-managed and Helm-managed clusters |
NVCF API Key (NAK) |
Also called Service Account Key (SAK). Used by NVCA to authenticate with a self-hosted control plane. Managed outside of NGC. |
Self-hosted NVCF |
Prerequisites#
Access to a Kubernetes cluster including GPU-enabled nodes (“GPU cluster”)
The cluster must have a compatible version of Kubernetes.
The cluster must have the NVIDIA GPU Operator installed.
If your cloud provider does not support the NVIDIA GPU Operator, Manual Instance Configuration is possible, but not recommended due to lack of maintainability.
To get the most out of clusters with multi-node NVLink (MNNVL) GPUs like GB200, the NVIDIA GPU DRA driver must be installed. See the NVLink-optimized Clusters for details.
Registering the cluster requires
kubectlandhelminstalled.The user registering the cluster must have the
cluster-adminrole privileges to install the NVIDIA Cluster Agent Operator (nvca-operator).The user registering the cluster must have the Cloud Functions Admin role within their NGC organization.
Supported Kubernetes Versions#
Minimum Kubernetes Supported Version:
v1.25.0Maximum Kubernetes Supported Version
v1.32.x
Considerations#
The NVIDIA Cluster Agent currently only supports caching if the cluster is enabled with
StorageClassconfigurations. If the “Caching Support” capability is enabled, the agent will make the best effort by attempting to detect storage during deployments and fall back on non-cached workflows.All NVIDIA-managed clusters support autoscaling functionality fully for all heuristics. However, clusters registered to NVCF via the agent only support autoscaling via the function queue depth heuristic.
Each function and task requires several infrastructure containers be deployed alongside workload containers. These infrastructure containers collectively need 6 CPU cores and 8 Gi of system memory to execute. Each GPU node must have at least this many resources, ideally significantly more for workload resource usage.