Cluster Setup & Management#

Cloud Functions admins can install the NVIDIA Cluster Agent to enable existing GPU Clusters to act as deployment targets for NVCF functions. The NVIDIA Cluster Agent is a function deployment orchestrator that communicates with the NVCF control plane. This page describes how to do the following:

  • Register a cluster with NVCF using the NVIDIA Cluster Agent.

  • Configure the cluster by defining GPU instance types, configurations, regions, and authorized NCA (NVIDIA Cloud Account) IDs.

  • Verify the cluster setup was successful.

After installing the NVIDIA Cluster Agent on a cluster:

  • The registered cluster will show as a deployment option in the GET /v2/nvcf/clusterGroups API response, and Cloud Functions deployment menu.

  • Any functions under the cluster’s authorized NCA IDs can now deploy on the cluster.

Prerequisites#

  • Access to a Kubernetes cluster including GPU-enabled nodes (“GPU cluster”)

  • Registering the cluster requires kubectl and helm installed.

  • The user registering the cluster must have the cluster-admin role privileges to install the NVIDIA Cluster Agent Operator (nvca-operator).

  • The user registering the cluster must have the Cloud Functions Admin role within their NGC organization.

Supported Kubernetes Versions#

  • Minimum Kubernetes Supported Version: v1.25.0

  • Maximum Kubernetes Supported Version v1.31.x

Considerations#

  • The NVIDIA Cluster Agent currently only supports caching if the cluster is enabled with StorageClass configurations. If the “Caching Support” capability is enabled, the agent will make the best effort by attempting to detect storage during deployments and fall back on non-cached workflows.

  • All NVIDIA-managed clusters support autoscaling functionality fully for all heuristics. However, clusters registered to NVCF via the agent only support autoscaling via the function queue depth heuristic.

Register the Cluster#

Reach the cluster registration page by navigating to Cloud Functions in the NGC product dropdown, and choosing “Settings” on the left-hand menu. You must be a Cloud Functions Admin to see this page. Choose “Register Cluster” to begin the registration process.

../_images/cluster-setup-list.png

Configuration#

../_images/cluster-setup-page.png

See below for descriptions of all cluster configuration options.

Field

Description

Cluster Name

The name for the cluster. This field is not changeable once configured.

Cluster Group

The name of the cluster group. This is usually identical to the cluster name, except in cases when there are multiple clusters you’d like to group. This would be done to enable a function to deploy on any of the clusters when the group is selected (for example, due to identical hardware support).

Compute Platform

The cloud platform the cluster is deployed on. This field is a standard part of the node name label format that the cluster agent uses: <Platform>.GPU.<GPUName>

Region

The region the cluster is deployed in. This field is required for enabling future optimization and configuration when deploying functions.

Cluster Description

Optional description for the cluster, this provides additional context about the cluster and will be returned in the cluster list under the Settings page, and the /listClusters API response.

Other Attributes

Tag your cluster with additional properties.

CacheOptimized: Enables rapid instance spin-up, requires extra storage configuration and caching support attributed in the Advanced Cluster Setup - See Advanced Settings.

KataRunTimeIsolation: Cluster is equipped with enhanced setup to ensure superior workload isolation using Kata Containers.

Elevating efficiency for rapid instance spin-up, mandating extra storage configuration and caching support attribute in Advanced cluster setup.

By default, the cluster will be authorized to the NCA ID of the current NGC organization being used during cluster configuration. If you choose to share the cluster with other NGC organizations, you will need to retrieve their corresponding NCA IDs. Sharing the cluster will allow other NVCF accounts to deploy cloud functions on it, with no limitations on how many GPUs within the cluster they deploy on.

Note

NVCF “accounts” are directly tied to, and defined by, NCA IDs (“NVIDIA Cloud Account”). Each NGC organization, with access to the Cloud Functions UI, has a corresponding NGC Organization Name and NCA ID. Please see the NGC Organization Profile Page to find these details.

Warning

Once functions from other NGC organizations have been deployed on the cluster, removing them from the authorized NCA IDs list, or removing sharing completely from the cluster, can cause disruption of service. Ideally, any functions tied to other NCA IDs should be undeployed before the NCA ID is removed from the authorized NCA IDs list.

Advanced Settings#

../_images/cluster-setup-advanced-settings.png

See below for descriptions of all capability options in the “Advanced Settings” section of the cluster configuration. Note that for customer-managed clusters (registered via the Cluster Agent) Dynamic GPU Discovery is enabled by default. For NVIDIA managed clusters, Collect Function Logs is also enabled by default.

Capability

Description

Dynamic GP Discovery

Enables automatic detection and management of allocatable GPU capacity within the cluster via the NVIDIA GPU Operator. This capability is strongly recommended and would only be disabled in cases where Manual Instance Configuration is required.

Collect Function Logs

This capability enables the emission of comprehensive Cluster Agent logs, which are then forwarded to the NVIDIA internal team, aiding in diagnosing and resolving issues effectively. When enabled these will not be visible in the UI, but are always available by running commands to retrieve logs directly on the cluster.

Caching Support

Enhances application performance by storing frequently accessed data (models, resources and containers) in a cache. See Caching Support.

Note

Removing the Dynamic GPU Discovery will require manual instance configuration. See Manual Instance Configuration.

Caching Support#

Enabling caching for models, resources and containers is recommended for optimal performance. You must create StorageClass configurations for caching within your cluster to fully enable “Caching Support” with the Cluster Agent. See examples below:

StorageClass Configurations in GCP

nvcf-sc.yaml#
 1kind: StorageClass
 2apiVersion: storage.k8s.io/v1
 3metadata:
 4    name: nvcf-sc
 5provisioner: pd.csi.storage.gke.io
 6allowVolumeExpansion: true
 7volumeBindingMode: Immediate
 8reclaimPolicy: Retain
 9parameters:
10    type: pd-ssd
11    csi.storage.k8s.io/fstype: xfs
nvcf-cc-sc.yaml#
 1kind: StorageClass
 2apiVersion: storage.k8s.io/v1
 3metadata:
 4    name: nvcf-cc-sc
 5provisioner: pd.csi.storage.gke.io
 6allowVolumeExpansion: true
 7volumeBindingMode: Immediate
 8reclaimPolicy: Retain
 9parameters:
10    type: pd-ssd
11    csi.storage.k8s.io/fstype: xfs

Note

GCP currently allows only 10 VM’s to mount a Persistent Volume in Read-Only mode.

StorageClass Configurations in Azure

nvcf-sc.yaml#
 1kind: StorageClass
 2apiVersion: storage.k8s.io/v1
 3metadata:
 4    name: nvcf-sc
 5provisioner: file.csi.azure.com
 6allowVolumeExpansion: true
 7volumeBindingMode: Immediate
 8reclaimPolicy: Retain
 9parameters:
10    skuName: Standard_LRS
11    csi.storage.k8s.io/fstype: xfs
nvcf-cc-sc.yaml#
 1kind: StorageClass
 2apiVersion: storage.k8s.io/v1
 3metadata:
 4    name: nvcf-cc-sc
 5provisioner: file.csi.azure.com
 6allowVolumeExpansion: true
 7volumeBindingMode: Immediate
 8reclaimPolicy: Retain
 9parameters:
10    skuName: Standard_LRS
11    csi.storage.k8s.io/fstype: xfs

StorageClass Configurations in Oracle Cloud

nvcf-sc.yaml#
 1kind: StorageClass
 2apiVersion: storage.k8s.io/v1
 3metadata:
 4  name: nvcf-sc
 5provisioner: blockvolume.csi.oraclecloud.com
 6allowVolumeExpansion: true
 7volumeBindingMode: Immediate
 8reclaimPolicy: Retain
 9parameters:
10  csi.storage.k8s.io/fstype: xfs
nvcf-cc-sc.yaml#
 1kind: StorageClass
 2apiVersion: storage.k8s.io/v1
 3metadata:
 4  name: nvcf-cc-sc
 5provisioner: blockvolume.csi.oraclecloud.com
 6allowVolumeExpansion: true
 7volumeBindingMode: Immediate
 8reclaimPolicy: Retain
 9parameters:
10  csi.storage.k8s.io/fstype: xfs

Apply the StorageClass Configurations

Save the StorageClass template to files nvcf-sc.yaml and nvcf-cc-sc.yaml and apply them as:

1kubectl create -f nvcf-sc.yaml
2kubectl create -f nvcf-cc-sc.yaml

Install the Cluster Agent#

../_images/cluster-setup-install-operator.png

After configuring the cluster, an NGC Cluster Key will be generated for authenticating to NGC, and you will be presented with a command snippet for installing the NVIDIA Cluster Agent Operator. Please refer to this command snippet for the most up-to-date installation instructions.

Note

The NGC Cluster Key has a default expiration of 90 days. Either on a regular cadence or when nearing expiration, you must rotate your NGC Cluster Key.

Once the Cluster Agent Operator installation is complete, the operator will automatically install the desired NVIDIA Cluster Agent version and the Status of the cluster in the Cluster Page will become “Ready”.

Afterward, you will be able to modify the configuration at any time. The cluster name and SSA client ID (only available for NVIDIA managed clusters) are not reconfigurable. Please refer to any additional installation instructions for reconfiguration in the UI. Once the configuration is updated, the Cluster Agent Operator, which polls for changes every 15 minutes, will apply the new configuration.

View & Validate Cluster Setup#

Verify Cluster Agent Installation via UI#

At any time, you can view the clusters you have begun registering, or registered, along with their status, on the Settings page.

../_images/cluster-setup-resume-registration.png ../_images/cluster-setup-resume-registration-2.png
  • A status of Ready indicates the Cluster Agent has registered the cluster with NVCF successfully.

  • A status of Not Ready indicates the registration command has either just been applied and is in progress, or that registration is failing.

Note

While your cluster status may show as Ready, there may be a delay of up to 5 minutes before the “Cluster Agent Version” column updates from “Update in Progress”. You can verify the actual cluster health and version immediately using the terminal commands described in the next section.

In cases when registration is failing, please use the following command to retrieve additional details:

1kubectl get nvcfbackend -n nvca-operator

When a cluster is Not Ready, you can resume registration at any time to finish the installation.

The “GPU Utilization” column is based on the number of GPUs occupied over the number of GPUs available within the cluster. The “Last Connected” column indicates when the last status update was received from the Cluster Agent to the NVCF control plane.

Verify Cluster Agent Installation via Terminal#

Verify the installation was successful via the following command, you should see a “healthy” response, as in this example:

1> kubectl get nvcfbackend -n nvca-operator
2NAME AGE VERSION HEALTH
3nvcf-trt-mgpu-cluster 3d16h 2.30.4 healthy

Upgrade the Cluster Agent#

NVCA upgrades available are indicated in the UI by status icons next to the cluster’s NVCA version.

../_images/cluster-upgrade-indicators.png

Note

When triggering the update through the UI, it may take up to 1 minute for the status icon to change to the “updating” status. This delay is expected.

Operator Upgrade Required#

When an Operator upgrade is required, a helm install instruction will be generated in the UI. You must run this first in order to enable the NVCA upgrade.

../_images/view-upgrade-instructions.png

Example:

1helm upgrade nvca-operator -n nvca-operator --create-namespace -i --reuse-values --wait \
2    "https://helm.stg.ngc.nvidia.com/nvidia/nvcf-byoc/charts/nvca-operator-1.13.2.tgz" \
3    --username='$oauthtoken' \
4    --password=$(helm get values -n nvca-operator nvca-operator -o json | jq -r '.ngcConfig.serviceKey')

You can verify that the upgrade is successful by running the following command and noting the “Image” field:

1kubectl get pods -n nvca-operator -o yaml | grep -i image

After the NVCA operator upgrade succeeds:

  1. Press “Update” in the UI to proceed with the NVCA upgrade

  2. The NVCA operator will periodically check for the new version of NVCA and apply it when available

  3. This may take up to 10 minutes to fully complete

NVCA-Only Upgrade#

When an upgrade to the NVCA Operator is not required:

  1. Simply trigger the update of NVCA through the UI

  2. The operator will check for the new desired version of NVCA and apply it

Verify that the operator has rolled out a successful upgrade by running the below command and looking for the spec version and status version to confirm the version of the CRD:

1kubectl get nvcfbackends -n nvca-operator -o yaml

Verify NVCA is healthy and version matches the desired version:

1kubectl get nvcfbackend -n nvca-operator

Deregister the Cluster#

Removing a configured cluster as a deployment target for NVCF functions is a two step process. It involves deleting the cluster from NGC followed by executing a series of commands on the cluster that remove the installed Cluster Agent and the NVCA operator.

Prerequisites#

First, ensure all functions have been undeployed from the cluster. This can be done via the UI, API or CLI.

To delete all function pods in a namespace by force, use the following command. This is not a graceful operation.

1kubectl delete pods -l FUNCTION_ID -n nvcf-backend

Verify all pods have been terminated with the following command:

1kubectl get pods -n nvcf-backend

This should return no pods if deletion was successful. Additionally, check the logs of the Cluster Agent pod in the nvca-system namespace to ensure there are no more “Pod is being terminated” messages:

1kubectl logs -n nvca-system $(kubectl get pods -n nvca-system | grep nvca | awk '{print $1}')

If pods are still hanging during termination, you can force the deletion of the namespace with the following command. This command removes the finalizers from the namespace metadata, which are hooks that prevent namespace deletion until certain cleanup tasks are complete. By removing these finalizers, you bypass the normal cleanup process and force the namespace to be deleted immediately.:

1kubectl get namespace nvcf-backend -o json | jq '.spec.finalizers=[]' | kubectl replace --raw /api/v1/namespaces/nvcf-backend/finalize -f -

Delete the Cluster via UI#

Next, reach the cluster registration page by navigating to Cloud Functions in the NGC product dropdown, and choosing “Settings” on the left-hand menu. You must be a Cloud Functions Admin to see this page.

Select “Delete Cluster” from the “Actions” dropdown for the cluster that you want to remove.

../_images/cluster-delete-select.png

Following this, a dialog box will appear that asks to you confirm the deletion of the cluster. Click on “Delete” to remove the cluster as a deployment target.

../_images/cluster-delete-confirm.png

Delete the Cluster Agent and the NVCA Operator#

Finally, on the registered cluster, execute the following commands to complete the deregisteration process.

1kubectl delete nvcfbackends -A --all
2kubectl delete ns nvca-system
3helm uninstall -n nvca-operator nvca-operator
4kubectl delete ns nvca-operator

Note

If the deletion of the Kubernetes CRD nvcfbackend blocks, then the finalizer (nvca-operator.finalizers.nvidia.io) needs to be manually deleted, for example using the following command:

1kubectl get nvcfbackend -n nvca-operator -o json | jq '.items[].metadata.finalizers = []' | kubectl replace -f -

Cluster Agent Monitoring and Reliability#

Monitoring Data#

Metrics#

Prerequisites

To use the PodMonitor and ServiceMonitor examples below, you must first install the Prometheus Operator. Follow the Prometheus Operator installation guide to set this up in your cluster.

The cluster agent and operator emit Prometheus-style metrics. The following metric labels are available by default. The full list of available metrics are updated regularly and therefore not listed.

Metric Label

Metric Label Description

nvca_event_name

The name of the event

nvca_nca_id

The NCA ID of this NVCA instance

nvca_cluster_name

The NVCA cluster name

nvca_cluster_group

The NVCA cluster group

nvca_version

The NVCA version

Cluster maintainers can scrape the available metrics. See a full example of how to do this with an OpenTelemetry Collector in the cluster here.

Use the following examples of a PodMonitor for NVCA Operator and ServiceMonitor for NVCA for reference:

Sample NVCA Operator PodMonitor

 1apiVersion: monitoring.coreos.com/v1
 2kind: PodMonitor
 3metadata:
 4    labels:
 5        app.kubernetes.io/component: metrics
 6        app.kubernetes.io/instance: prometheus-agent
 7        app.kubernetes.io/name: metrics-nvca-operator
 8        jobLabel: metrics-nvca-operator
 9        release: prometheus-agent
10        prometheus.agent/podmonitor-discover: "true"
11    name: metrics-nvca-operator
12    namespace: monitoring
13spec:
14    podMetricsEndpoints:
15    - port: http
16      scheme: http
17      path: /metrics
18    jobLabel: jobLabel
19    selector:
20        matchLabels:
21            app.kubernetes.io/name: nvca-operator
22    namespaceSelector:
23        matchNames:
24        - nvca-operator

Sample NVCA ServiceMonitor

 1apiVersion: monitoring.coreos.com/v1
 2kind: ServiceMonitor
 3metadata:
 4    labels:
 5        app.kubernetes.io/component: metrics
 6        app.kubernetes.io/instance: prometheus-agent
 7        app.kubernetes.io/name: metrics-nvca
 8        jobLabel: metrics-nvca
 9        release: prometheus-agent
10        prometheus.agent/servicemonitor-discover: "true"
11    name: prometheus-agent-nvca
12    namespace: monitoring
13spec:
14    endpoints:
15    - port: nvca
16    jobLabel: jobLabel
17    selector:
18        matchLabels:
19            app.kubernetes.io/name: nvca
20    namespaceSelector:
21        matchNames:
22        - nvca-system

Logs#

Both the Cluster Agent and Cluster Agent Operator emit logs locally by default.

Local logs for the NVIDIA Cluster Agent Operator can be obtained via kubectl:

1kubectl logs -l app.kubernetes.io/instance=nvca-operator -n nvca-operator --tail 20

Similarly, NVIDIA Cluster Agent logs can be obtained with the following command via kubectl:

1kubectl logs -l  app.kubernetes.io/instance=nvca -n nvca-system --tail 20

Warning

Current function-level inference container logs are not supported for functions deployed on non-NVIDIA-managed clusters. Customers are encouraged to emit logs directly from their inference containers running on their own clusters to any third-party tool, there are no public egress limitations for containers.

Tracing#

The NVIDIA Cluster Agent provides OpenTelemetry integration for exporting traces and events to compatible collectors. As of agent version 2.0, the only supported collector receiver is Lightstep.

Enable Tracing with Lightstep

  1. Get your Lightstep access token from the Lightstep UI and set to LS_ACCESS_TOKEN environment variable.

  2. Get the NVCF cluster name:

1nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"
  1. Apply the tracing configuration:

1kubectl patch nvcfbackends.nvcf.nvidia.io -n nvca-operator "$nvcf_cluster_name" --type='json' -p="[{\"op\": \"replace\", \"path\": \"/spec/overrides/featureGate/otelConfig\", \"value\": { \"exporter\": \"lightstep\", \"serviceName\": \"nvcf-nvca\", \"accessToken\": \"${LS_ACCESS_TOKEN}\"}}]"

Cluster Key Rotation#

To regenerate or rotate a cluster’s key, choose the “Regenerate Key” option from the Clusters table on the Settings page. Please refer to this command snippet for the most up-to-date upgrade instructions.

Warning

Updating your Service Key may interrupt any in-progress updates or deployments to existing functions, therefore it’s important to pause deployments before upgrading.

../_images/cluster-setup-regenerate-key.png

Network Configuration#

The NVCA operator requires outbound network connectivity to pull images, charts, and report logs and metrics. During installation, the operator pre-configures the nvca-namespace-networkpolicies configmap with the following network policies:

Policy Name

Description

allow-egress-gxcache

Allows egress traffic to the GX Cache namespace for caching operations (only relevant for NVIDIA managed clusters)

allow-egress-internet-no- internal-no-api

Allows egress traffic to the public internet (0.0.0.0/0) but blocks traffic to common private IP ranges. Also allows DNS resolution via kube-dns.

allow-egress-nvcf-cache

Allows egress traffic to NVCF cache services (only relevant for NVIDIA managed clusters)

allow-egress-prometheus- nvcf-byoo

Allows egress traffic to Prometheus monitoring endpoints (only relevant for NVIDIA managed clusters)

allow-ingress-monitoring

Allows ingress traffic for monitoring services

allow-ingress-monitoring-dcgm

Allows ingress traffic for DCGM monitoring

allow-ingress-monitoring- gxcache

Allows ingress traffic for GX Cache monitoring (only relevant for NVIDIA managed clusters)

Key Network Requirements#

  1. Kubernetes API Access

  • NVCA requires access to the Kubernetes API

  • Consult your cloud provider’s documentation (e.g., Azure, AWS, GCP) for the Kubernetes API endpoint

  1. Container Registry and NVCF Control Plane Access

  • Access to nvcr.io and helm.ngc.nvidia.com is required to pull container images, resources, and helm charts.

  • NVCA requires access to NVIDIA control plane services for coordination of functions and task deployments and invocation, this includes:

    • connect.pnats.nvcf.nvidia.com

    • grpc.api.nvcf.nvidia.com

    • *.api.nvcf.nvidia.com

    • sqs.*.amazonaws.com

    • spot.gdn.nvidia.com

    • ess.ngc.nvidia.com

    • api.ngc.nvidia.com

  • For invocation with assets, AWS S3 is required to be whitelisted for the cluster (these are dynamically generated endpoints).

  1. Monitoring and Logging

  • If your environment requires advanced monitoring or logging (e.g., sending logs to external endpoints), ensure your cluster’s NetworkPolicy or firewall rules allow egress to the required monitoring/logging domains

Network Policy Customization via ConfigMap#

The NVCA operator pre-configures the nvca-namespace-networkpolicies configmap during installation. If you need to customize these policies for your cluster, you can use a configmap to override the default policies.

To customize a network policy:

  1. Create a configmap with your custom network policy, for example:

    patchcm.yaml#
     1apiVersion: v1
     2kind: ConfigMap
     3metadata:
     4  name: demopatch-configmap
     5  namespace: nvca-operator
     6  labels:
     7    nvca.nvcf.nvidia.io/operator-kustomization: enabled
     8data:
     9  patches: |
    10    - target:
    11        group: ""
    12        version: v1
    13        kind: ConfigMap
    14        name: nvca-namespace-networkpolicies
    15      patch: |-
    16        - op: replace
    17          path: /data/allow-egress-internet-no-internal-no-api
    18          value: |
    19            apiVersion: networking.k8s.io/v1
    20            kind: NetworkPolicy
    21            metadata:
    22              name: allow-egress-internet-no-internal-no-api
    23              labels:
    24                app.kubernetes.io/name: nvca
    25                app.kubernetes.io/instance: nvca
    26                app.kubernetes.io/version: "1.0"
    27                app.kubernetes.io/managed-by: nvca-operator
    28            spec:
    29              podSelector: {}
    30              policyTypes:
    31                - Egress
    32              egress:
    33                - to:
    34                  - namespaceSelector: {}
    35                    podSelector:
    36                      matchLabels:
    37                        k8s-app: kube-dns
    38                - to:
    39                  - namespaceSelector:
    40                      matchLabels:
    41                        kubernetes.io/metadata.name: gxcache
    42                  ports:
    43                    - port: 8888
    44                      protocol: TCP
    45                    - port: 8889
    46                      protocol: TCP
    
  2. Apply the configmap:

    1kubectl apply -f patchcm.yaml
    
  3. Verify the changes:

    1kubectl logs -n nvca-operator -l app.kubernetes.io/name=nvca-operator
    

    You should see a message indicating successful patching: configmap patched successfully

The changes will be applied to the nvcf-backend namespace and will be used for all new namespaces’ network policies. The network policies will also be updated across all helm chart namespaces.

Network Policy Customization via clusterNetworkCIDRs Flag#

You can customize the allow-egress-internet-no-internal-no-api policy with helm, by adding on the networkPolicy.clusterNetworkCIDRs flag. For example:

1helm upgrade nvca-operator -n nvca-operator --create-namespace -i --reuse-values --wait "https://helm.stg.ngc.nvidia.com/nvidia/nvcf-byoc/charts/nvca-operator-1.14.0.tgz" --username='$oauthtoken' --password=$(helm get values -n nvca-operator nvca-operator -o json | jq -r '.ngcConfig.serviceKey') --set networkPolicy.clusterNetworkCIDRs="{10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,100.64.0.0/12}"

This command will override the default k8s networking CIDRs specified in the allow-egress-internet-no-internal-no-api with your input.

Advanced: NVCA Operator Configuration Options#

Below are additional configuration options for reference purposes.

Node Selection for Cloud Functions#

By default, the cluster agent uses all nodes discovered with GPU resources to schedule Cloud Functions and there are no additional configuration required.

In order to limit the nodes that can run Cloud Functions, you may use nvca.nvcf.nvidia.io/schedule=true label on the specific nodes.

If there are no nodes in the cluster with the nvca.nvcf.nvidia.io/schedule=true label set, the cluster agent will switch to the default behavior of using all nodes with GPUs.

For example, to mark specific nodes as schedulable in a cluster:

1kubectl label node <node-name> nvca.nvcf.nvidia.io/schedule=true

To mark a single node from the above set as unschedulable for nvcf workloads, you can unlabel using:

1kubectl label node <node-name> nvca.nvcf.nvidia.io/schedule-

Managing Feature Flags#

The NVIDIA Cluster Agent supports various feature flags that can be enabled or disabled to customize its behavior. The following are some commonly used feature flags that can be enabled or disabled:

Feature Flag

Description

DynamicGPUDiscovery

Dynamically discover GPUs and instance types on this cluster. This is enabled by default for customer-managed clusters.

HelmSharedStorage

Configure Helm functions and tasks with shared read-only storage for ESS secrets. This is required for enabling Helm-based tasks in your cluster. Please note turning on this feature flag requires additional configuration, see Helm Shared Storage section below.

LogPosting

Post instance logs to SIS directly. This is enabled by default for NVIDIA managed clusters.

MultiNodeWorkloads

Instruct NVCA to report multi-node instance types to SIS during registration.

To manage feature flags directly:

  1. Get the NVCF cluster name:

1nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"
  1. First, view current feature flags and determine which ones you want to preserve versus modify:

1kubectl get nvcfbackends -n nvca-operator -o yaml | grep -A 5 "featureGate:"
  1. To modify feature flags, you can use the patch command. Note that this will override all feature flags.

Warning

When modifying feature flags, you must preserve any existing feature flags you want to keep. The patch command will override all feature flags, so you need to include all desired feature flags in the value array.

1kubectl patch nvcfbackends.nvcf.nvidia.io -n nvca-operator "$nvcf_cluster_name" --type='json' -p='[{"op": "replace", "path": "/spec/overrides/featureGate/values", "value":["LogPosting","CachingSupport"]}]'

As an alternative to the patch command, you can also modify the feature flags using the edit command:

 1kubectl edit nvcfbackend -n nvca-operator
 2...
 3spec:
 4  featureGate:
 5    values:
 6    - LogPosting                # Existing feature flag
 7  overrides:
 8    featureGate:
 9      values:
10      - LogPosting              # Existing feature flag copied over
11      - -CachingSupport         # Caching support disabled
12      ...
  1. Verify the changes:

1kubectl get pods -n nvca-system -o yaml | grep -i feature

Enable Helm Shared Storage#

The NVIDIA Cluster Agent supports shared storage for Helm charts through the SMB CSI driver. This feature is required for enabling Helm-based tasks in your cluster.

Note

The Helm shared storage feature must be enabled before you can use Helm-based tasks in your cluster. This feature provides the necessary storage infrastructure for Helm chart operations.

Warning

When enabling the Helm shared storage feature flag, you must preserve any existing feature flags. The patch command will override all feature flags, so you need to include all desired feature flags in the value array. If you already have other feature flags enabled, you should include them along with “HelmSharedStorage” in the value array.

  1. First, install the SMB CSI driver using Helm:

1helm repo add csi-driver-smb https://raw.githubusercontent.com/kubernetes-csi/csi-driver-smb/master/charts
2helm install csi-driver-smb csi-driver-smb/csi-driver-smb --namespace kube-system --version v1.16.0
  1. Get the NVCF cluster name:

1nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"
  1. Enable the Helm shared storage feature flag:

1kubectl patch nvcfbackends.nvcf.nvidia.io -n nvca-operator "$nvcf_cluster_name" --type='json' -p='[{"op": "replace", "path": "/spec/overrides/featureGate/values", "value":["LogPosting","HelmSharedStorage", "CachingSupport"]}]'
  1. Verify that the feature flag is enabled:

1kubectl get pods -n nvca-system -o yaml | grep HelmSharedStorage

Advanced: Manual Instance Configuration#

Warning

It is highly recommended to rely on the Dynamic GPU Discovery, and therefore the NVIDIA GPU Operator, as manual instance configuration is error-prone.

This type of configuration is only necessary when the cluster Cloud Provider does not support the NVIDIA GPU Operator.

To enable manual instance configuration, remove the “Dynamic GPU Discovery” capability.

../_images/cluster-setup-advanced-configuration.png

All fields in the generated example configuration in the UI are required. Start by choosing “Apply Example” to copy over the example configuration, and then modify it to your cluster’s instance specifications.

Example Configuration

 1[
 2    {
 3        "name": "A100",
 4        "capacity": "20",
 5        "instanceTypes": [
 6            {
 7                "name": "Standard_ND96amsr_A100_v4_1x",
 8                "value": "Standard_ND96amsr_A100_v4",
 9                "description": "Single 80 GB A100 GPU",
10                "default": "true",
11                "cpuCores": "4",
12                "systemMemory": "16G",
13                "gpuMemory": "80G",
14                "gpuCount": "1"
15            },
16            {
17                "name": "Standard_ND96amsr_A100_v4_2x",
18                "value": "Standard_ND96amsr_A100_v4",
19                "description": "Two 80 GB A100 GPU",
20                "cpuCores": "4",
21                "systemMemory": "16G",
22                "gpuMemory": "80G",
23                "gpuCount": "2"
24            }
25        ]
26    }
27]

Manual Instance Type Configuration#

Prerequisites#

Since you are not using the GPU Operator, you must ensure each GPU node has the instance-type label that matches the “value” field in your manual configuration:

1kubectl label nodes <node-name> nvca.nvcf.nvidia.io/instance-type=<instance-type-value>

For example, if your configuration specifies "value": "OCI.GPU.A10", you would label the node with:

1kubectl label nodes gpu-node-1 nvca.nvcf.nvidia.io/instance-type=OCI.GPU.A10

Configuration Fields#

The following fields are critical for proper cluster registration and function deployment. Incorrect values will cause NVCA installation or function deployment failures:

Field

Description

name

The GPU model name that matches the NVIDIA GPU in your cluster nodes. This must match exactly what is reported by nvidia-smi. For example: “A100”, “H100”, “L40S”

capacity

The total number of GPUs of this type available across all nodes in your cluster. You can get this by running: kubectl get node -o json | jq -r '.items[].status.capacity."nvidia.com/gpu"' | grep -v null | awk '{s+=$1} END {print s}'

value

The value that matches what you set for the nvca.nvcf.nvidia.io/instance-type node label. For example: “OCI.GPU.A10” This must exactly match the label value you set with kubectl label nodes <node-name> nvca.nvcf.nvidia.io/instance-type=<value>

gpuCount

The number of GPUs allocated to each instance of this type. Must match the actual GPU count on the node, which can be verified with: kubectl get node <node-name> -o json | jq '.status.capacity."nvidia.com/gpu"'

instanceTypes -> name

A unique identifier for this instance type configuration. Should be descriptive of the GPU count and node type, for example: “Standard_ND96amsr_A100_v4_1x” for a single GPU configuration

Warning

Double check these critical values against your actual cluster configuration. Mismatches will prevent the NVIDIA Cluster Agent from properly managing GPU resources.

Cloud Provider Specific Requirements#

Oracle Cloud Infrastructure (OCI)#

When using Oracle Container Engine for Kubernetes (OKE), ensure that:

  • Your compute nodes and GPU nodes are in the same availability domain

  • This is required for proper network connectivity between the NVIDIA Cluster Agent and GPU nodes

  • Flannel CNI is the current recommended and validated CNI vs OCI native CNI for OKE cluster networking.