Cluster Setup & Management#

Cloud Functions admins can install the NVIDIA Cluster Agent to enable existing GPU Clusters to act as deployment targets for NVCF functions. The NVIDIA Cluster Agent is a function deployment orchestrator that communicates with the NVCF control plane. This page describes how to do the following:

Register a cluster with NVCF using the NVIDIA Cluster Agent.
Configure the cluster by defining GPU instance types, configurations, regions, and authorized NCA (NVIDIA Cloud Account) IDs.
Verify the cluster setup was successful.

After installing the NVIDIA Cluster Agent on a cluster:

The registered cluster will show as a deployment option in the GET /v2/nvcf/clusterGroups API response, and Cloud Functions deployment menu.
Any functions under the cluster’s authorized NCA IDs can now deploy on the cluster.

Prerequisites#

Access to a Kubernetes cluster including GPU-enabled nodes (“GPU cluster”)
- The cluster must have a compatible version of Kubernetes.
- The cluster must have the NVIDIA GPU Operator installed.
  If your cloud provider does not support the NVIDIA GPU Operator, Manual Instance Configuration is possible, but not recommended due to lack of maintainability.
Registering the cluster requires kubectl and helm installed.
The user registering the cluster must have the cluster-admin role privileges to install the NVIDIA Cluster Agent Operator (nvca-operator).
The user registering the cluster must have the Cloud Functions Admin role within their NGC organization.

Supported Kubernetes Versions#

Minimum Kubernetes Supported Version: v1.25.0
Maximum Kubernetes Supported Version v1.31.x

Considerations#

The NVIDIA Cluster Agent currently only supports caching if the cluster is enabled with StorageClass configurations. If the “Caching Support” capability is enabled, the agent will make the best effort by attempting to detect storage during deployments and fall back on non-cached workflows.
All NVIDIA-managed clusters support autoscaling functionality fully for all heuristics. However, clusters registered to NVCF via the agent only support autoscaling via the function queue depth heuristic.

Register the Cluster#

Reach the cluster registration page by navigating to Cloud Functions in the NGC product dropdown, and choosing “Settings” on the left-hand menu. You must be a Cloud Functions Admin to see this page. Choose “Register Cluster” to begin the registration process.

Configuration#

See below for descriptions of all cluster configuration options.

Field	Description
Cluster Name	The name for the cluster. This field is not changeable once configured.
Cluster Group	The name of the cluster group. This is usually identical to the cluster name, except in cases when there are multiple clusters you’d like to group. This would be done to enable a function to deploy on any of the clusters when the group is selected (for example, due to identical hardware support).
Compute Platform	The cloud platform the cluster is deployed on. This field is a standard part of the node name label format that the cluster agent uses: <Platform>.GPU.<GPUName>
Region	The region the cluster is deployed in. This field is required for enabling future optimization and configuration when deploying functions.
Cluster Description	Optional description for the cluster, this provides additional context about the cluster and will be returned in the cluster list under the Settings page, and the `/listClusters` API response.
Other Attributes	Tag your cluster with additional properties. CacheOptimized: Enables rapid instance spin-up, requires extra storage configuration and caching support attributed in the Advanced Cluster Setup - See Advanced Settings. KataRunTimeIsolation: Cluster is equipped with enhanced setup to ensure superior workload isolation using Kata Containers.

Elevating efficiency for rapid instance spin-up, mandating extra storage configuration and caching support attribute in Advanced cluster setup.

By default, the cluster will be authorized to the NCA ID of the current NGC organization being used during cluster configuration. If you choose to share the cluster with other NGC organizations, you will need to retrieve their corresponding NCA IDs. Sharing the cluster will allow other NVCF accounts to deploy cloud functions on it, with no limitations on how many GPUs within the cluster they deploy on.

Note

NVCF “accounts” are directly tied to, and defined by, NCA IDs (“NVIDIA Cloud Account”). Each NGC organization, with access to the Cloud Functions UI, has a corresponding NGC Organization Name and NCA ID. Please see the NGC Organization Profile Page to find these details.

Warning

Once functions from other NGC organizations have been deployed on the cluster, removing them from the authorized NCA IDs list, or removing sharing completely from the cluster, can cause disruption of service. Ideally, any functions tied to other NCA IDs should be undeployed before the NCA ID is removed from the authorized NCA IDs list.

Advanced Settings#

See below for descriptions of all capability options in the “Advanced Settings” section of the cluster configuration. Note that for customer-managed clusters (registered via the Cluster Agent) Dynamic GPU Discovery is enabled by default. For NVIDIA managed clusters, Collect Function Logs is also enabled by default.

Capability	Description
Dynamic GP Discovery	Enables automatic detection and management of allocatable GPU capacity within the cluster via the NVIDIA GPU Operator. This capability is strongly recommended and would only be disabled in cases where Manual Instance Configuration is required.
Collect Function Logs	This capability enables the emission of comprehensive Cluster Agent logs, which are then forwarded to the NVIDIA internal team, aiding in diagnosing and resolving issues effectively. When enabled these will not be visible in the UI, but are always available by running commands to retrieve logs directly on the cluster.
Caching Support	Enhances application performance by storing frequently accessed data (models, resources and containers) in a cache. See Caching Support.

Note

Removing the Dynamic GPU Discovery will require manual instance configuration. See Manual Instance Configuration.

Caching Support#

Enabling caching for models, resources and containers is recommended for optimal performance. You must create StorageClass configurations for caching within your cluster to fully enable “Caching Support” with the Cluster Agent. See examples below.

Note

Caching is currently not supported for AWS EKS.

StorageClass Configurations in GCP

nvcf-sc.yaml#

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
    name: nvcf-sc
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: true
volumeBindingMode: Immediate
reclaimPolicy: Retain
parameters:
    type: pd-ssd
    csi.storage.k8s.io/fstype: xfs

nvcf-cc-sc.yaml#

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
    name: nvcf-cc-sc
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: true
volumeBindingMode: Immediate
reclaimPolicy: Retain
parameters:
    type: pd-ssd
    csi.storage.k8s.io/fstype: xfs

Note

GCP currently allows only 10 VM’s to mount a Persistent Volume in Read-Only mode.

StorageClass Configurations in Azure

nvcf-sc.yaml#

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
    name: nvcf-sc
provisioner: file.csi.azure.com
allowVolumeExpansion: true
volumeBindingMode: Immediate
reclaimPolicy: Retain
parameters:
    skuName: Standard_LRS
    csi.storage.k8s.io/fstype: xfs

nvcf-cc-sc.yaml#

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
    name: nvcf-cc-sc
provisioner: file.csi.azure.com
allowVolumeExpansion: true
volumeBindingMode: Immediate
reclaimPolicy: Retain
parameters:
    skuName: Standard_LRS
    csi.storage.k8s.io/fstype: xfs

StorageClass Configurations in Oracle Cloud

nvcf-sc.yaml#

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: nvcf-sc
provisioner: blockvolume.csi.oraclecloud.com
allowVolumeExpansion: true
volumeBindingMode: Immediate
reclaimPolicy: Retain
parameters:
  csi.storage.k8s.io/fstype: xfs

nvcf-cc-sc.yaml#

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: nvcf-cc-sc
provisioner: blockvolume.csi.oraclecloud.com
allowVolumeExpansion: true
volumeBindingMode: Immediate
reclaimPolicy: Retain
parameters:
  csi.storage.k8s.io/fstype: xfs

Apply the StorageClass Configurations

Save the StorageClass template to files nvcf-sc.yaml and nvcf-cc-sc.yaml and apply them as:

kubectl create -f nvcf-sc.yaml
kubectl create -f nvcf-cc-sc.yaml

Override the Default Mount Options for Cache Volumes

Note

Supported in Cluster Agent Versions 2.45.21 or higher

Warning

Please note this is a Post NVCA Install Operation and needs careful consideration to ensure there are no volume corruptions. Use with caution.

Cluster Agent with caching support by default will enable linux mount-options with ro,norecovery,nouuid.

If the CSI Driver in the cluster doesn’t support mount options then you may apply the following command on the cluster to disable the mount options

nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"
kubectl patch nvcfbackend -n nvca-operator "$nvcf_cluster_name" --type=merge -p '{"spec":{"overrides":{"featureGate":{"cacheCSIVolumeMountOptionsConfig":{"disabled":true}}}}}'

If you want to update the mount-options to a different value for example: ro,norecovery. You may use the following command. Replace these options with desired value as dictated by CSI Driver Volume Mount Options.

nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"
kubectl patch nvcfbackend -n nvca-operator "$nvcf_cluster_name" --type=merge -p '{"spec":{"overrides":{"featureGate":{"cacheCSIVolumeMountOptionsConfig":{"disabled":false}}}}}'
kubectl patch nvcfbackend -n nvca-operator "$nvcf_cluster_name" --type=merge -p '{"spec":{"overrides":{"featureGate":{"cacheCSIVolumeMountOptionsConfig":{"mountOptions":"ro,norecovery"}}}}}'

Install the Cluster Agent#

../_images/cluster-setup-install-operator.png

After configuring the cluster, an NGC Cluster Key will be generated for authenticating to NGC, and you will be presented with a command snippet for installing the NVIDIA Cluster Agent Operator. Please refer to this command snippet for the most up-to-date installation instructions.

Note

The NGC Cluster Key has a default expiration of 90 days. Either on a regular cadence or when nearing expiration, you must rotate your NGC Cluster Key.

Once the Cluster Agent Operator installation is complete, the operator will automatically install the desired NVIDIA Cluster Agent version and the Status of the cluster in the Cluster Page will become “Ready”.

Afterward, you will be able to modify the configuration at any time. The cluster name and SSA client ID (only available for NVIDIA managed clusters) are not reconfigurable. Please refer to any additional installation instructions for reconfiguration in the UI. Once the configuration is updated, the Cluster Agent Operator, which polls for changes every 15 minutes, will apply the new configuration.

View & Validate Cluster Setup#

Verify Cluster Agent Installation via UI#

At any time, you can view the clusters you have begun registering, or registered, along with their status, on the Settings page.

../_images/cluster-setup-resume-registration.png

../_images/cluster-setup-resume-registration-2.png

A status of Ready indicates the Cluster Agent has registered the cluster with NVCF successfully.
A status of Not Ready indicates the registration command has either just been applied and is in progress, or that registration is failing.

Note

While your cluster status may show as Ready, there may be a delay of up to 5 minutes before the “Cluster Agent Version” column updates from “Update in Progress”. You can verify the actual cluster health and version immediately using the terminal commands described in the next section.

In cases when registration is failing, please use the following command to retrieve additional details:

kubectl get nvcfbackend -n nvca-operator

When a cluster is Not Ready, you can resume registration at any time to finish the installation.

The “GPU Utilization” column is based on the number of GPUs occupied over the number of GPUs available within the cluster. The “Last Connected” column indicates when the last status update was received from the Cluster Agent to the NVCF control plane.

Verify Cluster Agent Installation via Terminal#

Verify the installation was successful via the following command, you should see a “healthy” response, as in this example:

> kubectl get nvcfbackend -n nvca-operator
NAME AGE VERSION HEALTH
nvcf-trt-mgpu-cluster 3d16h 2.30.4 healthy

Upgrade the Cluster Agent#

NVCA upgrades available are indicated in the UI by status icons next to the cluster’s NVCA version.

../_images/cluster-upgrade-indicators.png

Note

When triggering the update through the UI, it may take up to 1 minute for the status icon to change to the “updating” status. This delay is expected.

Operator Upgrade Required#

When an Operator upgrade is required, a helm install instruction will be generated in the UI. You must run this first in order to enable the NVCA upgrade.

../_images/view-upgrade-instructions.png

Example:

helm upgrade nvca-operator -n nvca-operator --create-namespace -i --reuse-values --wait \
    "https://helm.stg.ngc.nvidia.com/nvidia/nvcf-byoc/charts/nvca-operator-1.13.2.tgz" \
    --username='$oauthtoken' \
    --password=$(helm get values -n nvca-operator nvca-operator -o json | jq -r '.ngcConfig.serviceKey')

You can verify that the upgrade is successful by running the following command and noting the “Image” field:

kubectl get pods -n nvca-operator -o yaml | grep -i image

After the NVCA operator upgrade succeeds:

Press “Update” in the UI to proceed with the NVCA upgrade
The NVCA operator will periodically check for the new version of NVCA and apply it when available
This may take up to 10 minutes to fully complete

NVCA-Only Upgrade#

When an upgrade to the NVCA Operator is not required:

Simply trigger the update of NVCA through the UI
The operator will check for the new desired version of NVCA and apply it

Verify that the operator has rolled out a successful upgrade by running the below command and looking for the spec version and status version to confirm the version of the CRD:

kubectl get nvcfbackends -n nvca-operator -o yaml

Verify NVCA is healthy and version matches the desired version:

kubectl get nvcfbackend -n nvca-operator

Deregister the Cluster#

Removing a configured cluster as a deployment target for NVCF functions is a two step process. It involves deleting the cluster from NGC followed by executing a series of commands on the cluster that remove the installed Cluster Agent and the NVCA operator.

Prerequisites#

First, ensure all functions have been undeployed from the cluster. This can be done via the UI, API or CLI.

To delete all function pods in a namespace by force, use the following command. This is not a graceful operation.

kubectl delete pods -l FUNCTION_ID -n nvcf-backend

Verify all pods have been terminated with the following command:

kubectl get pods -n nvcf-backend

This should return no pods if deletion was successful. Additionally, check the logs of the Cluster Agent pod in the nvca-system namespace to ensure there are no more “Pod is being terminated” messages:

kubectl logs -n nvca-system $(kubectl get pods -n nvca-system | grep nvca | awk '{print $1}')

If pods are still hanging during termination, you can force the deletion of the namespace with the following command. This command removes the finalizers from the namespace metadata, which are hooks that prevent namespace deletion until certain cleanup tasks are complete. By removing these finalizers, you bypass the normal cleanup process and force the namespace to be deleted immediately.:

kubectl get namespace nvcf-backend -o json | jq '.spec.finalizers=[]' | kubectl replace --raw /api/v1/namespaces/nvcf-backend/finalize -f -

Delete the Cluster via UI#

Warning

Deleting the Cluster will also delete the Cluster Key used by the cluster and is irreversible.

Next, reach the cluster registration page by navigating to Cloud Functions in the NGC product dropdown, and choosing “Settings” on the left-hand menu. You must be a Cloud Functions Admin to see this page.

Select “Delete Cluster” from the “Actions” dropdown for the cluster that you want to remove.

Following this, a dialog box will appear that asks to you confirm the deletion of the cluster. Click on “Delete” to remove the cluster as a deployment target.

Delete the Cluster Agent and the NVCA Operator#

Finally, on the registered cluster, execute the following commands to complete the deregisteration process.

kubectl delete nvcfbackends -A --all
kubectl delete ns nvca-system
helm uninstall -n nvca-operator nvca-operator
kubectl delete ns nvca-operator

Note

If the deletion of the Kubernetes CRD nvcfbackend blocks, then the finalizer (nvca-operator.finalizers.nvidia.io) needs to be manually deleted, for example using the following command:

kubectl get nvcfbackend -n nvca-operator -o json | jq '.items[].metadata.finalizers = []' | kubectl replace -f -

Cluster Agent Monitoring and Reliability#

Monitoring Data#

Metrics#

Prerequisites

To use the PodMonitor and ServiceMonitor examples below, you must first install the Prometheus Operator. Follow the Prometheus Operator installation guide to set this up in your cluster.

The cluster agent and operator emit Prometheus-style metrics. The following metric labels are available by default. The full list of available metrics are updated regularly and therefore not listed.

Metric Label	Metric Label Description
nvca_event_name	The name of the event
nvca_nca_id	The NCA ID of this NVCA instance
nvca_cluster_name	The NVCA cluster name
nvca_cluster_group	The NVCA cluster group
nvca_version	The NVCA version

Cluster maintainers can scrape the available metrics. See a full example of how to do this with an OpenTelemetry Collector in the cluster here.

Use the following examples of a PodMonitor for NVCA Operator and ServiceMonitor for NVCA for reference:

Sample NVCA Operator PodMonitor

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
    labels:
        app.kubernetes.io/component: metrics
        app.kubernetes.io/instance: prometheus-agent
        app.kubernetes.io/name: metrics-nvca-operator
        jobLabel: metrics-nvca-operator
        release: prometheus-agent
        prometheus.agent/podmonitor-discover: "true"
    name: metrics-nvca-operator
    namespace: monitoring
spec:
    podMetricsEndpoints:
    - port: http
      scheme: http
      path: /metrics
    jobLabel: jobLabel
    selector:
        matchLabels:
            app.kubernetes.io/name: nvca-operator
    namespaceSelector:
        matchNames:
        - nvca-operator

Sample NVCA ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
    labels:
        app.kubernetes.io/component: metrics
        app.kubernetes.io/instance: prometheus-agent
        app.kubernetes.io/name: metrics-nvca
        jobLabel: metrics-nvca
        release: prometheus-agent
        prometheus.agent/servicemonitor-discover: "true"
    name: prometheus-agent-nvca
    namespace: monitoring
spec:
    endpoints:
    - port: nvca
    jobLabel: jobLabel
    selector:
        matchLabels:
            app.kubernetes.io/name: nvca
    namespaceSelector:
        matchNames:
        - nvca-system

Logs#

Both the Cluster Agent and Cluster Agent Operator emit logs locally by default.

Local logs for the NVIDIA Cluster Agent Operator can be obtained via kubectl:

kubectl logs -l app.kubernetes.io/instance=nvca-operator -n nvca-operator --tail 20

Similarly, NVIDIA Cluster Agent logs can be obtained with the following command via kubectl:

kubectl logs -l  app.kubernetes.io/instance=nvca -n nvca-system --tail 20

Warning

Current function-level inference container logs are not supported for functions deployed on non-NVIDIA-managed clusters. Customers are encouraged to emit logs directly from their inference containers running on their own clusters to any third-party tool, there are no public egress limitations for containers.

Tracing#

The NVIDIA Cluster Agent provides OpenTelemetry integration for exporting traces and events to compatible collectors. As of agent version 2.0, the only supported collector receiver is Lightstep.

Enable Tracing with Lightstep

Get your Lightstep access token from the Lightstep UI and set to LS_ACCESS_TOKEN environment variable.
Get the NVCF cluster name:

nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"

Apply the tracing configuration:

kubectl patch nvcfbackends.nvcf.nvidia.io -n nvca-operator "$nvcf_cluster_name" --type='json' -p="[{\"op\": \"replace\", \"path\": \"/spec/overrides/featureGate/otelConfig\", \"value\": { \"exporter\": \"lightstep\", \"serviceName\": \"nvcf-nvca\", \"accessToken\": \"${LS_ACCESS_TOKEN}\"}}]"

Cluster Key Rotation#

Note

Cluster Keys are migrating to a organization-scoped service-key resource. If your cluster is using a personal key, it is recommended to follow the UI instructions for business continuity.

To regenerate or rotate a cluster’s key, choose the “Rotate Key” option from the Clusters listing on the Settings page. Please refer to this command snippet for the most up-to-date upgrade instructions.

Warning

Updating your Service Key may interrupt any in-progress updates or deployments to existing functions, therefore it’s important to pause deployments before upgrading.

A cluster key expiring soon will have an appropriate warning displayed as below and follow the UI instructions to Rotate Key as above.

Frequently Asked Questions#

My Cluster Key is a personal key, but it is valid for long time, should I still rotate ?

While the Cluster Key may still be valid, if you observe a directive it is recommended to rotate the key for business continuity
How many clusters can be registered in the Org ?

Number of cluster keys are capped at 50 per org and this includes any other service keys in use in the Org. If you observe error while registering cluster, contact the Org Admin using the Contact Admin option.

Network Configuration#

Warning

The network policies described in this section are only enforced if your cluster’s Container Network Interface (CNI) supports Kubernetes Network Policies. Common CNIs that support network policies include:

Calico
Cilium
Weave Net
Antrea

If your cluster uses a CNI that doesn’t support network policies, the security controls described below will not be enforced, and pods will be able to communicate with each other without restrictions. This could lead to security vulnerabilities.

The NVCA operator requires outbound network connectivity to pull images, charts, and report logs and metrics. During installation, the operator pre-configures the nvca-namespace-networkpolicies configmap with the following network policies:

Policy Name	Description
allow-egress-gxcache	Allows egress traffic to the GX Cache namespace for caching operations (only relevant for NVIDIA managed clusters)
allow-egress-internet-no- internal-no-api	Allows egress traffic to the public internet (0.0.0.0/0) but blocks traffic to common private IP ranges. Also allows DNS resolution via kube-dns.
allow-egress-intra-namespace	Controls pod-to-pod communication within the same namespace. This policy is only applied to function namespaces and not to shared pod instance namespaces.
allow-egress-nvcf-cache	Allows egress traffic to NVCF cache services (only relevant for NVIDIA managed clusters)
allow-egress-prometheus- nvcf-byoo	Allows egress traffic to Prometheus monitoring endpoints (only relevant for NVIDIA managed clusters)
allow-ingress-monitoring	Allows ingress traffic for monitoring services
allow-ingress-monitoring-dcgm	Allows ingress traffic for DCGM monitoring
allow-ingress-monitoring- gxcache	Allows ingress traffic for GX Cache monitoring (only relevant for NVIDIA managed clusters)

Key Network Requirements#

Kubernetes API Access

NVCA requires access to the Kubernetes API

Consult your cloud provider’s documentation (e.g., Azure, AWS, GCP) for the Kubernetes API endpoint

Container Registry and NVCF Control Plane Access

Access to nvcr.io and helm.ngc.nvidia.com is required to pull container images, resources, and helm charts.

NVCA requires access to NVIDIA control plane services for coordination of functions and task deployments and invocation, this includes:

connect.pnats.nvcf.nvidia.com

grpc.api.nvcf.nvidia.com

*.api.nvcf.nvidia.com

sqs.*.amazonaws.com

spot.gdn.nvidia.com

ess.ngc.nvidia.com

api.ngc.nvidia.com

For invocation with assets, AWS S3 is required to be whitelisted for the cluster (these are dynamically generated endpoints).

Monitoring and Logging

If your environment requires advanced monitoring or logging (e.g., sending logs to external endpoints), ensure your cluster’s NetworkPolicy or firewall rules allow egress to the required monitoring/logging domains

Network Policy Customization via ConfigMap#

The NVCA operator pre-configures the nvca-namespace-networkpolicies configmap during installation. If you need to customize these policies for your cluster, you can use a configmap to override the default policies.

To customize a network policy:

Create a configmap with your custom network policy, for example:

patchcm.yaml#

apiVersion: v1
kind: ConfigMap
metadata:
  name: demopatch-configmap
  namespace: nvca-operator
  labels:
    nvca.nvcf.nvidia.io/operator-kustomization: enabled
data:
  patches: |
    - target:
        group: ""
        version: v1
        kind: ConfigMap
        name: nvca-namespace-networkpolicies
      patch: |-
        - op: replace
          path: /data/allow-egress-internet-no-internal-no-api
          value: |
            apiVersion: networking.k8s.io/v1
            kind: NetworkPolicy
            metadata:
              name: allow-egress-internet-no-internal-no-api
              labels:
                app.kubernetes.io/name: nvca
                app.kubernetes.io/instance: nvca
                app.kubernetes.io/version: "1.0"
                app.kubernetes.io/managed-by: nvca-operator
            spec:
              podSelector: {}
              policyTypes:
                - Egress
              egress:
                - to:
                  - namespaceSelector: {}
                    podSelector:
                      matchLabels:
                        k8s-app: kube-dns
                - to:
                  - namespaceSelector:
                      matchLabels:
                        kubernetes.io/metadata.name: gxcache
                  ports:
                    - port: 8888
                      protocol: TCP
                    - port: 8889
                      protocol: TCP

Apply the configmap:
```
1kubectl apply -f patchcm.yaml
```
Verify the changes:
```
1kubectl logs -n nvca-operator -l app.kubernetes.io/name=nvca-operator
```
You should see a message indicating successful patching: configmap patched successfully

The changes will be applied to the nvcf-backend namespace and will be used for all new namespaces’ network policies. The network policies will also be updated across all helm chart namespaces.

Network Policy Customization via clusterNetworkCIDRs Flag#

You can customize the allow-egress-internet-no-internal-no-api policy with helm, by adding on the networkPolicy.clusterNetworkCIDRs flag. For example:

helm upgrade nvca-operator -n nvca-operator --create-namespace -i --reuse-values --wait "https://helm.stg.ngc.nvidia.com/nvidia/nvcf-byoc/charts/nvca-operator-1.14.0.tgz" --username='$oauthtoken' --password=$(helm get values -n nvca-operator nvca-operator -o json | jq -r '.ngcConfig.serviceKey') --set networkPolicy.clusterNetworkCIDRs="{10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,100.64.0.0/12}"

This command will override the default k8s networking CIDRs specified in the allow-egress-internet-no-internal-no-api with your input.

Advanced: Additional Configuration Options#

CSI Volume Mount Options#

The NVIDIA Cluster Agent supports customizing CSI volume mount options for caching. This allows you to configure specific mount options for the CSI volumes used in your cluster.

Warning

CSI volume mount options configuration is an experimental feature and may be subject to change in future releases.

To configure CSI volume mount options:

Get the NVCF cluster name:

nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"

View current mount options configuration:

kubectl get nvcfbackend -n nvca-operator "$nvcf_cluster_name" -o yaml | grep -A 5 "cacheCSIVolumeMountOptionsConfig"

Set mount options (example):

kubectl patch nvcfbackends.nvcf.nvidia.io -n nvca-operator "$nvcf_cluster_name" --type='json' -p='[{"op": "replace", "path": "/spec/overrides/featureGate/cacheCSIVolumeMountOptionsConfig", "value": {"disabled": false, "mountOptions": "ro,norecovery,nouuid"}}]'

Verify the changes:

kubectl get nvcfbackend -n nvca-operator "$nvcf_cluster_name" -o yaml | grep -A 5 "cacheCSIVolumeMountOptionsConfig"

The default mount options are: - ro: Read-only mount - norecovery: Skip journal recovery - nouuid: Ignore filesystem UUID

You can modify these options based on your specific requirements. The configuration will be applied to all CSI volumes created by the NVIDIA Cluster Agent for caching purposes.

Node Selection for Cloud Functions#

By default, the cluster agent uses all nodes discovered with GPU resources to schedule Cloud Functions and there are no additional configuration required.

In order to limit the nodes that can run Cloud Functions, you may use nvca.nvcf.nvidia.io/schedule=true label on the specific nodes.

If there are no nodes in the cluster with the nvca.nvcf.nvidia.io/schedule=true label set, the cluster agent will switch to the default behavior of using all nodes with GPUs.

For example, to mark specific nodes as schedulable in a cluster:

kubectl label node <node-name> nvca.nvcf.nvidia.io/schedule=true

To mark a single node from the above set as unschedulable for nvcf workloads, you can unlabel using:

kubectl label node <node-name> nvca.nvcf.nvidia.io/schedule-

GPU Product Name Override#

The NVIDIA Cluster Agent supports GPU product name override via node label. This is useful for customers who want to use a custom product name or override the default GPU product name.

For example, to set the GPU product name for a node, use the following command:

kubectl label node <node-name> nvca.nvcf.nvidia.io/gpu.product=<product-name>

Managing Feature Flags#

The NVIDIA Cluster Agent supports various feature flags that can be enabled or disabled to customize its behavior. The following are some commonly used feature flags that can be enabled or disabled:

Feature Flag	Description
DynamicGPUDiscovery	Dynamically discover GPUs and instance types on this cluster. This is enabled by default for customer-managed clusters.
HelmSharedStorage	Configure Helm functions and tasks with shared read-only storage for ESS secrets. This is required for enabling Helm-based tasks in your cluster. Please note turning on this feature flag requires additional configuration, see Helm Shared Storage section below.
LogPosting	Post instance logs to SIS directly. This is enabled by default for NVIDIA managed clusters.
MultiNodeWorkloads	Instruct NVCA to report multi-node instance types to SIS during registration.

To manage feature flags directly:

Get the NVCF cluster name:

nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"

First, view current feature flags and determine which ones you want to preserve versus modify:

kubectl get nvcfbackends -n nvca-operator -o yaml | grep -A 5 "featureGate:"

To modify feature flags, you can use the patch command. Note that this will override all feature flags.

Warning

When modifying feature flags, you must preserve any existing feature flags you want to keep. The patch command will override all feature flags, so you need to include all desired feature flags in the value array.

kubectl patch nvcfbackends.nvcf.nvidia.io -n nvca-operator "$nvcf_cluster_name" --type=merge -p '{"spec":{"overrides":{"featureGate":{"values":["LogPosting","CachingSupport"]}}}}'

As an alternative to the patch command, you can also modify the feature flags using the edit command:

kubectl edit nvcfbackend -n nvca-operator
...
spec:
  featureGate:
    values:
    - LogPosting                # Existing feature flag
  overrides:
    featureGate:
      values:
      - LogPosting              # Existing feature flag copied over
      - -CachingSupport         # Caching support disabled
      ...

Verify the changes:

kubectl get pods -n nvca-system -o yaml | grep -i feature

Enable Helm Shared Storage#

The NVIDIA Cluster Agent supports shared storage for Helm charts through the SMB CSI driver. This feature is required for enabling Helm-based tasks in your cluster.

Note

The Helm shared storage feature must be enabled before you can use Helm-based tasks in your cluster. This feature provides the necessary storage infrastructure for Helm chart operations.

Warning

When enabling the Helm shared storage feature flag, you must preserve any existing feature flags. The patch command will override all feature flags, so you need to include all desired feature flags in the value array. If you already have other feature flags enabled, you should include them along with “HelmSharedStorage” in the value array.

First, install the SMB CSI driver using Helm:

helm repo add csi-driver-smb https://raw.githubusercontent.com/kubernetes-csi/csi-driver-smb/master/charts
helm install csi-driver-smb csi-driver-smb/csi-driver-smb --namespace kube-system --version v1.16.0

Get the NVCF cluster name:

nvcf_cluster_name="$(kubectl get nvcfbackends -n nvca-operator -o name | cut -d'/' -f2)"

Enable the Helm shared storage feature flag:

kubectl patch nvcfbackends.nvcf.nvidia.io -n nvca-operator "$nvcf_cluster_name" --type=merge -p '{"spec":{"overrides":{"featureGate":{"values":["LogPosting","HelmSharedStorage", "CachingSupport"]}}}}'

Verify that the feature flag is enabled:

kubectl get pods -n nvca-system -o yaml | grep HelmSharedStorage

Advanced: Manual Instance Configuration#

Warning

It is highly recommended to rely on the Dynamic GPU Discovery, and therefore the NVIDIA GPU Operator, as manual instance configuration is error-prone.

This type of configuration is only necessary when the cluster Cloud Provider does not support the NVIDIA GPU Operator.

To enable manual instance configuration, remove the “Dynamic GPU Discovery” capability.

../_images/cluster-setup-advanced-configuration.png

All fields in the generated example configuration in the UI are required. Start by choosing “Apply Example” to copy over the example configuration, and then modify it to your cluster’s instance specifications.

Example Configuration

[
    {
        "name": "A100",
        "capacity": "20",
        "instanceTypes": [
            {
                "name": "Standard_ND96amsr_A100_v4_1x",
                "value": "Standard_ND96amsr_A100_v4",
                "description": "Single 80 GB A100 GPU",
                "default": "true",
                "cpuCores": "4",
                "systemMemory": "16G",
                "gpuMemory": "80G",
                "gpuCount": "1"
            },
            {
                "name": "Standard_ND96amsr_A100_v4_2x",
                "value": "Standard_ND96amsr_A100_v4",
                "description": "Two 80 GB A100 GPU",
                "cpuCores": "4",
                "systemMemory": "16G",
                "gpuMemory": "80G",
                "gpuCount": "2"
            }
        ]
    }
]

Manual Instance Type Configuration#

Prerequisites#

Since you are not using the GPU Operator, you must ensure each GPU node has the instance-type label that matches the “value” field in your manual configuration:

kubectl label nodes <node-name> nvca.nvcf.nvidia.io/instance-type=<instance-type-value>

For example, if your configuration specifies "value": "OCI.GPU.A10", you would label the node with:

kubectl label nodes gpu-node-1 nvca.nvcf.nvidia.io/instance-type=OCI.GPU.A10

Configuration Fields#

The following fields are critical for proper cluster registration and function deployment. Incorrect values will cause NVCA installation or function deployment failures:

Field	Description
name	The GPU model name that matches the NVIDIA GPU in your cluster nodes. This must match exactly what is reported by `nvidia-smi`. For example: “A100”, “H100”, “L40S”
capacity	The total number of GPUs of this type available across all nodes in your cluster. You can get this by running: `kubectl get node -o json \| jq -r '.items[].status.capacity."nvidia.com/gpu"' \| grep -v null \| awk '{s+=$1} END {print s}'`
value	The value that matches what you set for the `nvca.nvcf.nvidia.io/instance-type` node label. For example: “OCI.GPU.A10” This must exactly match the label value you set with `kubectl label nodes <node-name> nvca.nvcf.nvidia.io/instance-type=<value>`
gpuCount	The number of GPUs allocated to each instance of this type. Must match the actual GPU count on the node, which can be verified with: `kubectl get node <node-name> -o json \| jq '.status.capacity."nvidia.com/gpu"'`
instanceTypes -> name	A unique identifier for this instance type configuration. Should be descriptive of the GPU count and node type, for example: “Standard_ND96amsr_A100_v4_1x” for a single GPU configuration

Warning

Double check these critical values against your actual cluster configuration. Mismatches will prevent the NVIDIA Cluster Agent from properly managing GPU resources.

Cloud Provider Specific Notes#

Oracle Cloud Infrastructure (OCI)#

When using Oracle Container Engine for Kubernetes (OKE), ensure that:

Your compute nodes and GPU nodes are in the same availability domain
This is required for proper network connectivity between the NVIDIA Cluster Agent and GPU nodes
Flannel CNI is the current recommended and validated CNI vs OCI native CNI for OKE cluster networking.

AWS#

When using AWS EKS, note that the following limitations exist:

Caching is currently not supported