Cluster Setup & Management

Developer Guide (Latest)

Cloud Functions admins can install the NVIDIA Cluster Agent to enable existing GPU Clusters to act as deployment targets for NVCF functions. The NVIDIA Cluster Agent is a function deployment orchestrator that communicates with the NVCF control plane. This page describes how to do the following:

  • Register a cluster with NVCF using the NVIDIA Cluster Agent.

  • Configure the cluster by defining GPU instance types, configurations, region, and authorized NCA (NVIDIA Cloud Account) IDs.

  • Verify the cluster setup was successful.

After installing the NVIDIA Cluster Agent on a cluster:

  • The registered cluster will show as a deployment option in the GET /v2/nvcf/clusterGroups API response, and Cloud Functions deployment menu.

  • Any functions under the cluster’s authorized NCA IDs can now deploy on the cluster.

  • Access to a Kubernetes cluster including GPU-enabled nodes (“GPU cluster”)

  • Registering the cluster requires kubectl and helm installed.

  • The user registering the cluster must have the cluster-admin role privileges in order to install the NVIDIA Cluster Agent Operator (nvca-operator).

  • The user registering the cluster must have the Cloud Functions Admin role within their NGC organization.

Supported Kubernetes Versions

See Kubernetes Release for all support versions of Kubernetes.

Considerations

  • The NVIDIA Cluster Agent currently only supports caching if the cluster is enabled with StorageClass configurations. If the “Caching Support” capability is enabled, the agent will make a best effort by attempting to detect storage during deployments, and fall back on non-cached workflows.

  • All NVIDIA managed clusters support autoscaling functionality fully for all heuristics. However, clusters registered to NVCF via the agent only support autoscaling via the function queue depth heuristic.

Reach the cluster registration page by navigating to Cloud Functions in the NGC product dropdown, and choosing “Settings” on the left hand menu. You must be a Cloud Functions Admin in order to see this page. Choose “Register Cluster” to begin the registration process.

cluster-setup-list.png

Configuration

cluster-setup-page.png

See below for descriptions of all cluster configuration options.

Field

Description

Cluster Name The name for the cluster. This field is not changeable once configured.
Cluster Group The name for the cluster group. This is usually identical to the cluster name, except in cases when there are multiple clusters you’d like to group. This would be done to enable a function to deploy on any of the clusters when the group is selected (for example, due to identical hardware support).
Compute Platform The cloud platform the cluster is deployed on. This field is a standard part of the node name label format that the cluster agent uses: .GPU.
Region The region the cluster is deployed in. This field is required for enabling future optimization and configuration when deploying functions.
Cluster Description Optional description for the cluster, this provides additional context about the cluster and will be returned in the cluster list under the Settings page, and the /listClusters API response.
Other Attributes Tag your cluster with additional properties. CacheOptimized: Enables rapid instance spin-up, requires extra storage configuration and caching support attributed in the Advanced Cluster Setup - See Advanced Settings. KataRunTimeIsolation: Cluster is equipped with enhanced setup to ensure superior workload isolation using Kata Containers.

Elevating efficiency for rapid instance spin-up, mandating extra storage configuration and caching support attribute in Advanced cluster setup.

By default, the cluster will be authorized to the NCA ID of the current NGC organization being used during cluster configuration. If you choose to share the cluster with other NGC organizations, you will need to retrieve their corresponding NCA IDs. Sharing the cluster will allow other NVCF accounts to deploy cloud functions on it, with no limitations on how many GPUs within the cluster they deploy on.

Note

NVCF “accounts” are directly tied to, and defined by, NCA IDs (“NVIDIA Cloud Account”). Each NGC organization, with access to the Cloud Functions UI, has a corresponding NGC Organization Name and NCA ID. Please see the NGC Organization Profile Page to find these details.

Warning

Once functions from other NGC organizations have been deployed on the cluster, removing them from the authorized NCA IDs list, or removing sharing completely from the cluster, can cause disruption of service. Ideally, any functions tied to other NCA IDs should be undeployed before the NCA ID is removed from the authorized NCA IDs list.

Advanced Settings

cluster-setup-advanced-settings.png

See below for descriptions of all capability options in the “Advanced Settings” section of the cluster configuration. Note that for customer managed clusters (registered via the Cluster Agent) Dynamic GPU Discovery is enabled by default. For NVIDIA internal clusters, Collect Function Logs is also enabled by default.

Capability

Description

Dynamic GP Discovery Enables automatic detection and management of allocatable GPU capacity within the cluster via the NVIDIA GPU Operator. This capability is strongly recommended and would only be disabled in cases where Manual Instance Configuration is required.
Collect Function Logs This capability enables emission of comprehensive Cluster Agent logs, which are then forwarded to the NVIDIA internal team, aiding in diagnosing and resolving issues effectively. When enabled these will not be visible in the UI, but are always available by running commands to retrieve logs directly on the cluster.
Caching Support Enhances application performance by storing frequently accessed data (models, resources and containers) in a cache. See Caching Support.
Note

Removing the Dynamic GPU Discovery will require manual instance configuration. See Manual Instance Configuration.

Caching Support

Enabling caching for models, resources and containers is recommended for optimal performance. You must create StorageClass configurations for caching within your cluster to fully enable “Caching Support” with the Cluster Agent. See examples below:

StorageClass Configurations in GCP

nvcf-sc.yaml

Copy
Copied!
            

kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: nvcf-sc provisioner: pd.csi.storage.gke.io allowVolumeExpansion: true volumeBindingMode: Immediate reclaimPolicy: Retain parameters: type: pd-ssd csi.storage.k8s.io/fstype: xfs


nvcf-cc-sc.yaml

Copy
Copied!
            

kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: nvcf-cc-sc provisioner: pd.csi.storage.gke.io allowVolumeExpansion: true volumeBindingMode: Immediate reclaimPolicy: Retain parameters: type: pd-ssd csi.storage.k8s.io/fstype: xfs


Note

GCP currently allows only 10 VM’s to mount a Persistent Volume in Read-Only mode.

StorageClass Configurations in Azure

nvcf-sc.yaml

Copy
Copied!
            

kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: nvcf-sc provisioner: file.csi.azure.com allowVolumeExpansion: true volumeBindingMode: Immediate reclaimPolicy: Retain parameters: skuName: Standard_LRS csi.storage.k8s.io/fstype: xfs


nvcf-cc-sc.yaml

Copy
Copied!
            

kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: nvcf-cc-sc provisioner: file.csi.azure.com allowVolumeExpansion: true volumeBindingMode: Immediate reclaimPolicy: Retain parameters: skuName: Standard_LRS csi.storage.k8s.io/fstype: xfs


Apply the StorageClass Configurations

Save the StorageClass template to files nvcf-sc.yaml and nvcf-cc-sc.yaml and apply them as:

Copy
Copied!
            

kubectl create -f nvcf-sc.yaml kubectl create -f nvcf-cc-sc.yaml

Install the Cluster Agent

cluster-setup-install-operator.png

After configuring the cluster, an NGC Cluster Key will be generated for authenticating to NGC, and you will be presented with a command snippet for installing the NVIDIA Cluster Agent Operator. Please refer to this command snippet for the most up-to-date installation instructions.

Note

The NGC Cluster Key has a default expiration of 90 days. Either on a regular cadence or when nearing expiration, you must rotate your NGC Cluster Key.

Once the Cluster Agent Operator installation is complete, the operator will automatically install the desired NVIDIA Cluster Agent version and the Status of the cluster in the Cluster Page will become “Ready”.

Afterwards, you will be able to modify the configuration at any time. Cluster name and SSA client ID (only available for NVIDIA internal clusters) are not reconfigurable. Please refer to any additional installation instructions for reconfiguration in the UI. Once the configuration is updated, the Cluster Agent Operator, which polls for changes every 15 minutes, will apply the new configuration.

Verify Cluster Agent Installation via UI

At any time, you can view the clusters you have begun registering, or registered, along with their status, in the Settings page.

cluster-setup-resume-registration.png

cluster-setup-resume-registration-2.png

  • A status of Ready indicates the Cluster Agent has registered the cluster with NVCF successfully.

  • A status of Not Ready indicates the registration command has either just been applied and is in progress, or that registration is failing.

In cases when registration is failing, please use the following command for retrieving additional details:

Copy
Copied!
            

kubectl get nvcfbackend -n nvca-operator

When a cluster is Not Ready, you can resume registration at any time to finish installation.

The “GPU Utilization” column is based on the number of GPUs occupied over the number of GPUs available within the cluster. The “Last Connected” column indicates when the last status update was received from the Cluster Agent to the NVCF control plane.

Verify Cluster Agent Installation via Terminal

Verify the installation was successful via the following command, you should see a “healthy” response, as in this example:

Copy
Copied!
            

> kubectl get nvcfbackend -n nvca-operator NAME AGE VERSION HEALTH nvcf-trt-mgpu-cluster 3d16h 2.30.4 healthy

Monitoring Data

Metrics

The cluster agent and operator emit Prometheus style metrics. The following metrics and labels are available by default.

Metric Name

Metric Description

nvca_event_queue_length The length of a named event queue
nvca_event_process_latency The amount of time for processing an event in NVCA

Metric Label

Metric Label Description

nvca_event_name The name of the event
nvca_nca_id The NCA ID of this NVCA instance
nvca_cluster_name The NVCA cluster name
nvca_cluster_group The NVCA cluster group
nvca_version The NVCA version

Cluster maintainers can scrape the available metrics using the following examples of a PodMonitor for NVCA Operator and ServiceMonitor for NVCA for reference:

Sample NVCA Operator PodMonitor

Copy
Copied!
            

apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: labels: app.kubernetes.io/component: metrics app.kubernetes.io/instance: prometheus-agent app.kubernetes.io/name: metrics-nvca-operator jobLabel: metrics-nvca-operator release: prometheus-agent prometheus.agent/podmonitor-discover: "true" name: metrics-nvca-operator namespace: monitoring spec: podMetricsEndpoints: - port: http scheme: http path: /metrics jobLabel: jobLabel selector: matchLabels: app.kubernetes.io/name: nvca-operator namespaceSelector: matchNames: - nvca-operator

Sample NVCA ServiceMonitor

Copy
Copied!
            

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app.kubernetes.io/component: metrics app.kubernetes.io/instance: prometheus-agent app.kubernetes.io/name: metrics-nvca jobLabel: metrics-nvca release: prometheus-agent prometheus.agent/servicemonitor-discover: "true" name: prometheus-agent-nvca namespace: monitoring spec: endpoints: - port: nvca jobLabel: jobLabel selector: matchLabels: app.kubernetes.io/name: nvca namespaceSelector: matchNames: - nvca-system

Logs

Both the Cluster Agent and Cluster Agent Operator emit logs locally by default.

Local logs for the NVIDIA Cluster Agent Operator can be obtained via kubectl:

Copy
Copied!
            

kubectl logs -l app.kubernetes.io/instance=nvca-operator -n nvca-operator --tail 20

Similarly, NVIDIA Cluster Agent logs can be obtained with the following command via kubectl:

Copy
Copied!
            

kubectl logs -l app.kubernetes.io/instance=nvca -n nvca-system --tail 20

Warning

Current function level inference container logs are not supported for functions deployed on non-NVIDIA managed clusters. Customers are encouraged to emit logs directly from their inference containers running on their own clusters to any third party tool, there are no public egress limitations for containers.

Tracing

The NVIDIA Cluster Agent provides OpenTelemetry integration for exporting traces and events to compatible collectors. As of agent version 2.0, the only supported collector is Lightstep. See Advanced: NVCA Operator Configuration Options.

Cluster Key Rotation

To regenerate or rotate a cluster’s key, choose the “Regenerate Key” option from the Clusters table in the Settings page. Please refer to this command snippet for the most up-to-date upgrade instructions.

Warning

Updating your Service Key may interrupt any in progress updates or deployments to existing functions, therefore it’s important to pause deployments before upgrading.

cluster-setup-regenerate-key.png

The below are additional configuration options for reference purposes.

NVCA Operator Parameters

Name

Description

Value

image.repository NVCA Operator container registry path, without tag nvcr.io/nvidia/nvcf-byoc/nvca-operator
image.tag NVCA Operator container image tag. This defaults to the chart version “”
image.pullPolicy K8s ImagePullPolicy IfNotPresent
nvcaImage.repositoryOverride (Optional) Full NVCA container registry path, without tag. Only set this if the default needs to be overridden, for example “nvcr.io/nvidia/nvcf-byoc/nvca”. The tag is set in the cluster config “”
nvcaImage.pullPolicy K8s ImagePullPolicy IfNotPresent
replicaCount Replica count for the operator deployment 1
systemNamespace Namespace in which NVCFBackend objects are created. nvca-operator
logLevel Logging level for the module info
ncaID NVIDIA Cloud Account ID of the Primary Account “”
clusterID ID of the Cluster for this NVCA instance to manage “”
clusterName For metrics & telemetry “”
k8sVersionOverride Override the K8s version that NVCA registers with “”
priorityClassName K8s PriorityClassName for pod preference during evictions “”
skipFluxInit Skip Flux install if admin already has one installed false

NGC Configuration

Name

Description

Value

ngcConfig.username Username for the registry authentication $oauthtoken
ngcConfig.serviceKey ServiceKey (password) for authentication “”
ngcConfig.apiURL NGC API URL for requesting auth tokens https://api.ngc.nvidia.com

Node Selector Configuration

Name

Description

Value

nodeSelector.key Node-selector Label key node.kubernetes.io/instance-type
nodeSelector.value Node-selector Label value “”

OpenTelemetry Configuration

Name

Description

Value

otel.enabled Enable OpenTelemetry. false
otel.lightstep.serviceName the name of the Lightstep service to push telemetry data to “”
otel.lightstep.accessToken the access token for accessing the Lightstep API “”
Warning

It is highly recommended to rely on the Dynamic GPU Discovery, and therefore the NVIDIA GPU Operator, as manual instance configuration is error prone.

This type of configuration is only necessary when the cluster Cloud Provider does not support the NVIDIA GPU Operator.

In order to enable manual instance configuration, remove the “Dynamic GPU Discovery” capability.

cluster-setup-advanced-configuration.png

All fields in the generated example configuration in the UI are required. Start by choosing “Apply Example” to copy over the example configuration, and then modify it to your cluster’s instance specifications.

Previous Asset Management
© Copyright © 2024, NVIDIA Corporation. Last updated on Jun 7, 2024.