Cloud Functions admins can install the NVIDIA Cluster Agent to enable existing GPU Clusters to act as deployment targets for NVCF functions. The NVIDIA Cluster Agent is a function deployment orchestrator that communicates with the NVCF control plane. This page describes how to do the following:
Register a cluster with NVCF using the NVIDIA Cluster Agent.
Configure the cluster by defining GPU instance types, configurations, region, and authorized NCA (NVIDIA Cloud Account) IDs.
Verify the cluster setup was successful.
After installing the NVIDIA Cluster Agent on a cluster:
The registered cluster will show as a deployment option in the
GET /v2/nvcf/clusterGroups
API response, and Cloud Functions deployment menu.Any functions under the cluster’s authorized NCA IDs can now deploy on the cluster.
Access to a Kubernetes cluster including GPU-enabled nodes (“GPU cluster”)
The cluster must have a compatible version of Kubernetes.
The cluster must have the NVIDIA GPU Operator installed.
If your cloud provider does not support the NVIDIA GPU Operator, Manual Instance Configuration is possible, but not recommended due to lack of maintainability.
Registering the cluster requires
kubectl
andhelm
installed.The user registering the cluster must have the
cluster-admin
role privileges in order to install the NVIDIA Cluster Agent Operator (nvca-operator
).The user registering the cluster must have the Cloud Functions Admin role within their NGC organization.
Supported Kubernetes Versions
See Kubernetes Release for all support versions of Kubernetes.
Considerations
The NVIDIA Cluster Agent currently only supports caching if the cluster is enabled with
StorageClass
configurations. If the “Caching Support” capability is enabled, the agent will make a best effort by attempting to detect storage during deployments, and fall back on non-cached workflows.All NVIDIA managed clusters support autoscaling functionality fully for all heuristics. However, clusters registered to NVCF via the agent only support autoscaling via the function queue depth heuristic.
Reach the cluster registration page by navigating to Cloud Functions in the NGC product dropdown, and choosing “Settings” on the left hand menu. You must be a Cloud Functions Admin in order to see this page. Choose “Register Cluster” to begin the registration process.
![cluster-setup-list.png](https://docscontent.nvidia.com/dims4/default/a4defca/2147483647/strip/true/crop/2848x916+0+0/resize/1440x463!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fcluster-setup-list.png)
Configuration
![cluster-setup-page.png](https://docscontent.nvidia.com/dims4/default/26228a1/2147483647/strip/true/crop/3112x1696+0+0/resize/1440x785!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fcluster-setup-page.png)
See below for descriptions of all cluster configuration options.
Field | Description |
---|---|
Cluster Name | The name for the cluster. This field is not changeable once configured. |
Cluster Group | The name for the cluster group. This is usually identical to the cluster name, except in cases when there are multiple clusters you’d like to group. This would be done to enable a function to deploy on any of the clusters when the group is selected (for example, due to identical hardware support). |
Compute Platform | The cloud platform the cluster is deployed on. This field is a standard part of the node name label format that the cluster agent uses: |
Region | The region the cluster is deployed in. This field is required for enabling future optimization and configuration when deploying functions. |
Cluster Description | Optional description for the cluster, this provides additional context about the cluster and will be returned in the cluster list under the Settings page, and the /listClusters API response. |
Other Attributes | Tag your cluster with additional properties. CacheOptimized: Enables rapid instance spin-up, requires extra storage configuration and caching support attributed in the Advanced Cluster Setup - See Advanced Settings. KataRunTimeIsolation: Cluster is equipped with enhanced setup to ensure superior workload isolation using Kata Containers. |
Elevating efficiency for rapid instance spin-up, mandating extra storage configuration and caching support attribute in Advanced cluster setup.
By default, the cluster will be authorized to the NCA ID of the current NGC organization being used during cluster configuration. If you choose to share the cluster with other NGC organizations, you will need to retrieve their corresponding NCA IDs. Sharing the cluster will allow other NVCF accounts to deploy cloud functions on it, with no limitations on how many GPUs within the cluster they deploy on.
NVCF “accounts” are directly tied to, and defined by, NCA IDs (“NVIDIA Cloud Account”). Each NGC organization, with access to the Cloud Functions UI, has a corresponding NGC Organization Name and NCA ID. Please see the NGC Organization Profile Page to find these details.
Once functions from other NGC organizations have been deployed on the cluster, removing them from the authorized NCA IDs list, or removing sharing completely from the cluster, can cause disruption of service. Ideally, any functions tied to other NCA IDs should be undeployed before the NCA ID is removed from the authorized NCA IDs list.
Advanced Settings
![cluster-setup-advanced-settings.png](https://docscontent.nvidia.com/dims4/default/6da18a3/2147483647/strip/true/crop/2416x328+0+0/resize/1440x195!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fcluster-setup-advanced-settings.png)
See below for descriptions of all capability options in the “Advanced Settings” section of the cluster configuration. Note that for customer managed clusters (registered via the Cluster Agent) Dynamic GPU Discovery is enabled by default. For NVIDIA internal clusters, Collect Function Logs is also enabled by default.
Capability | Description |
---|---|
Dynamic GP Discovery | Enables automatic detection and management of allocatable GPU capacity within the cluster via the NVIDIA GPU Operator. This capability is strongly recommended and would only be disabled in cases where Manual Instance Configuration is required. |
Collect Function Logs | This capability enables emission of comprehensive Cluster Agent logs, which are then forwarded to the NVIDIA internal team, aiding in diagnosing and resolving issues effectively. When enabled these will not be visible in the UI, but are always available by running commands to retrieve logs directly on the cluster. |
Caching Support | Enhances application performance by storing frequently accessed data (models, resources and containers) in a cache. See Caching Support. |
Removing the Dynamic GPU Discovery will require manual instance configuration. See Manual Instance Configuration.
Caching Support
Enabling caching for models, resources and containers is recommended for optimal performance. You must create StorageClass
configurations for caching within your cluster to fully enable “Caching Support” with the Cluster Agent. See examples below:
StorageClass Configurations in GCP
nvcf-sc.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: nvcf-sc
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: true
volumeBindingMode: Immediate
reclaimPolicy: Retain
parameters:
type: pd-ssd
csi.storage.k8s.io/fstype: xfs
nvcf-cc-sc.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: nvcf-cc-sc
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: true
volumeBindingMode: Immediate
reclaimPolicy: Retain
parameters:
type: pd-ssd
csi.storage.k8s.io/fstype: xfs
GCP currently allows only 10 VM’s to mount a Persistent Volume in Read-Only mode.
StorageClass Configurations in Azure
nvcf-sc.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: nvcf-sc
provisioner: file.csi.azure.com
allowVolumeExpansion: true
volumeBindingMode: Immediate
reclaimPolicy: Retain
parameters:
skuName: Standard_LRS
csi.storage.k8s.io/fstype: xfs
nvcf-cc-sc.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: nvcf-cc-sc
provisioner: file.csi.azure.com
allowVolumeExpansion: true
volumeBindingMode: Immediate
reclaimPolicy: Retain
parameters:
skuName: Standard_LRS
csi.storage.k8s.io/fstype: xfs
Apply the StorageClass Configurations
Save the StorageClass template to files nvcf-sc.yaml
and nvcf-cc-sc.yaml
and apply them as:
kubectl create -f nvcf-sc.yaml
kubectl create -f nvcf-cc-sc.yaml
Install the Cluster Agent
![cluster-setup-install-operator.png](https://docscontent.nvidia.com/dims4/default/d9da535/2147483647/strip/true/crop/2678x1084+0+0/resize/1440x583!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fcluster-setup-install-operator.png)
After configuring the cluster, an NGC Cluster Key will be generated for authenticating to NGC, and you will be presented with a command snippet for installing the NVIDIA Cluster Agent Operator. Please refer to this command snippet for the most up-to-date installation instructions.
The NGC Cluster Key has a default expiration of 90 days. Either on a regular cadence or when nearing expiration, you must rotate your NGC Cluster Key.
Once the Cluster Agent Operator installation is complete, the operator will automatically install the desired NVIDIA Cluster Agent version and the Status of the cluster in the Cluster Page will become “Ready”.
Afterwards, you will be able to modify the configuration at any time. Cluster name and SSA client ID (only available for NVIDIA internal clusters) are not reconfigurable. Please refer to any additional installation instructions for reconfiguration in the UI. Once the configuration is updated, the Cluster Agent Operator, which polls for changes every 15 minutes, will apply the new configuration.
Verify Cluster Agent Installation via UI
At any time, you can view the clusters you have begun registering, or registered, along with their status, in the Settings page.
![cluster-setup-resume-registration.png](https://docscontent.nvidia.com/dims4/default/4bc8748/2147483647/strip/true/crop/2728x908+0+0/resize/1440x479!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fcluster-setup-resume-registration.png)
![cluster-setup-resume-registration-2.png](https://docscontent.nvidia.com/dims4/default/7af9bb6/2147483647/strip/true/crop/2696x1098+0+0/resize/1440x586!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fcluster-setup-resume-registration-2.png)
A status of
Ready
indicates the Cluster Agent has registered the cluster with NVCF successfully.A status of
Not Ready
indicates the registration command has either just been applied and is in progress, or that registration is failing.
In cases when registration is failing, please use the following command for retrieving additional details:
kubectl get nvcfbackend -n nvca-operator
When a cluster is Not Ready
, you can resume registration at any time to finish installation.
The “GPU Utilization” column is based on the number of GPUs occupied over the number of GPUs available within the cluster. The “Last Connected” column indicates when the last status update was received from the Cluster Agent to the NVCF control plane.
Verify Cluster Agent Installation via Terminal
Verify the installation was successful via the following command, you should see a “healthy” response, as in this example:
> kubectl get nvcfbackend -n nvca-operator
NAME AGE VERSION HEALTH
nvcf-trt-mgpu-cluster 3d16h 2.30.4 healthy
Monitoring Data
Metrics
The cluster agent and operator emit Prometheus style metrics. The following metrics and labels are available by default.
Metric Name | Metric Description |
---|---|
nvca_event_queue_length | The length of a named event queue |
nvca_event_process_latency | The amount of time for processing an event in NVCA |
Metric Label | Metric Label Description |
---|---|
nvca_event_name | The name of the event |
nvca_nca_id | The NCA ID of this NVCA instance |
nvca_cluster_name | The NVCA cluster name |
nvca_cluster_group | The NVCA cluster group |
nvca_version | The NVCA version |
Cluster maintainers can scrape the available metrics using the following examples of a PodMonitor for NVCA Operator and ServiceMonitor for NVCA for reference:
Sample NVCA Operator PodMonitor
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/instance: prometheus-agent
app.kubernetes.io/name: metrics-nvca-operator
jobLabel: metrics-nvca-operator
release: prometheus-agent
prometheus.agent/podmonitor-discover: "true"
name: metrics-nvca-operator
namespace: monitoring
spec:
podMetricsEndpoints:
- port: http
scheme: http
path: /metrics
jobLabel: jobLabel
selector:
matchLabels:
app.kubernetes.io/name: nvca-operator
namespaceSelector:
matchNames:
- nvca-operator
Sample NVCA ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/instance: prometheus-agent
app.kubernetes.io/name: metrics-nvca
jobLabel: metrics-nvca
release: prometheus-agent
prometheus.agent/servicemonitor-discover: "true"
name: prometheus-agent-nvca
namespace: monitoring
spec:
endpoints:
- port: nvca
jobLabel: jobLabel
selector:
matchLabels:
app.kubernetes.io/name: nvca
namespaceSelector:
matchNames:
- nvca-system
Logs
Both the Cluster Agent and Cluster Agent Operator emit logs locally by default.
Local logs for the NVIDIA Cluster Agent Operator can be obtained via kubectl
:
kubectl logs -l app.kubernetes.io/instance=nvca-operator -n nvca-operator --tail 20
Similarly, NVIDIA Cluster Agent logs can be obtained with the following command via kubectl:
kubectl logs -l app.kubernetes.io/instance=nvca -n nvca-system --tail 20
Current function level inference container logs are not supported for functions deployed on non-NVIDIA managed clusters. Customers are encouraged to emit logs directly from their inference containers running on their own clusters to any third party tool, there are no public egress limitations for containers.
Tracing
The NVIDIA Cluster Agent provides OpenTelemetry integration for exporting traces and events to compatible collectors. As of agent version 2.0, the only supported collector is Lightstep. See Advanced: NVCA Operator Configuration Options.
Cluster Key Rotation
To regenerate or rotate a cluster’s key, choose the “Regenerate Key” option from the Clusters table in the Settings page. Please refer to this command snippet for the most up-to-date upgrade instructions.
Updating your Service Key may interrupt any in progress updates or deployments to existing functions, therefore it’s important to pause deployments before upgrading.
![cluster-setup-regenerate-key.png](https://docscontent.nvidia.com/dims4/default/c7a9fa1/2147483647/strip/true/crop/2690x1110+0+0/resize/1440x594!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fcluster-setup-regenerate-key.png)
The below are additional configuration options for reference purposes.
NVCA Operator Parameters
Name | Description | Value |
---|---|---|
image.repository | NVCA Operator container registry path, without tag | nvcr.io/nvidia/nvcf-byoc/nvca-operator |
image.tag | NVCA Operator container image tag. This defaults to the chart version | “” |
image.pullPolicy | K8s ImagePullPolicy | IfNotPresent |
nvcaImage.repositoryOverride | (Optional) Full NVCA container registry path, without tag. Only set this if the default needs to be overridden, for example “nvcr.io/nvidia/nvcf-byoc/nvca”. The tag is set in the cluster config | “” |
nvcaImage.pullPolicy | K8s ImagePullPolicy | IfNotPresent |
replicaCount | Replica count for the operator deployment | 1 |
systemNamespace | Namespace in which NVCFBackend objects are created. | nvca-operator |
logLevel | Logging level for the module | info |
ncaID | NVIDIA Cloud Account ID of the Primary Account | “” |
clusterID | ID of the Cluster for this NVCA instance to manage | “” |
clusterName | For metrics & telemetry | “” |
k8sVersionOverride | Override the K8s version that NVCA registers with | “” |
priorityClassName | K8s PriorityClassName for pod preference during evictions | “” |
skipFluxInit | Skip Flux install if admin already has one installed | false |
NGC Configuration
Name | Description | Value |
---|---|---|
ngcConfig.username | Username for the registry authentication | $oauthtoken |
ngcConfig.serviceKey | ServiceKey (password) for authentication | “” |
ngcConfig.apiURL | NGC API URL for requesting auth tokens | https://api.ngc.nvidia.com |
Node Selector Configuration
Name | Description | Value |
---|---|---|
nodeSelector.key | Node-selector Label key | node.kubernetes.io/instance-type |
nodeSelector.value | Node-selector Label value | “” |
OpenTelemetry Configuration
Name | Description | Value |
---|---|---|
otel.enabled | Enable OpenTelemetry. | false |
otel.lightstep.serviceName | the name of the Lightstep service to push telemetry data to | “” |
otel.lightstep.accessToken | the access token for accessing the Lightstep API | “” |
It is highly recommended to rely on the Dynamic GPU Discovery, and therefore the NVIDIA GPU Operator, as manual instance configuration is error prone.
This type of configuration is only necessary when the cluster Cloud Provider does not support the NVIDIA GPU Operator.
In order to enable manual instance configuration, remove the “Dynamic GPU Discovery” capability.
![cluster-setup-advanced-configuration.png](https://docscontent.nvidia.com/dims4/default/1226d77/2147483647/strip/true/crop/2688x1610+0+0/resize/1440x863!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fcluster-setup-advanced-configuration.png)
All fields in the generated example configuration in the UI are required. Start by choosing “Apply Example” to copy over the example configuration, and then modify it to your cluster’s instance specifications.