NGC-Managed Clusters
NGC-Managed Clusters
NGC-Managed Clusters
NGC-managed clusters are registered and configured through the NGC UI. In this mode, configuration changes are applied through the web interface, and the NGC Cluster Key is used for authentication.
Reach the cluster registration page by navigating to Cloud Functions in the NGC product dropdown, and choosing “Clusters” on the left-hand menu. You must be a Cloud Functions Admin to see this page. Choose “Register Cluster” to begin the registration process.


See below for descriptions of all cluster configuration options.
Cluster Attributes:
By default, the cluster will be authorized to the NCA ID of the current NGC organization being used during cluster configuration. If you choose to share the cluster with other NGC organizations, you will need to retrieve their corresponding NCA IDs. Sharing the cluster will allow other NVCF accounts to deploy cloud functions on it, with no limitations on how many GPUs within the cluster they deploy on.
NVCF “accounts” are directly tied to, and defined by, NCA IDs (“NVIDIA Cloud Account”). Each NGC organization, with access to the Cloud Functions UI, has a corresponding NGC Organization Name and NCA ID. Please see the NGC Organization Profile Page to find these details.
Once functions from other NGC organizations have been deployed on the cluster, removing them from the authorized NCA IDs list, or removing sharing completely from the cluster, can cause disruption of service. Ideally, any functions tied to other NCA IDs should be undeployed before the NCA ID is removed from the authorized NCA IDs list.
For cluster configuration options that apply to all management modes (NGC-managed, Helm-managed, and Self-hosted), see NVCA Configuration. This includes:
Supported in Cluster Agent Versions 2.47.7 or higher
Cluster Agent can be optionally configured to run in maintenance mode if needed using the following capabilities from the Advanced Settings. Clusters configured in maintenance will not be listed as targets for workload deployments.

In this mode existing workloads will continue to run uninterrupted on the cluster.
In this mode all existing workloads in the cluster will be terminated. No update to the state of the workload will be effective at this time.
If using feature flags overrides on the cluster, the configuration for maintenance mode needs to be via feature flags CordonMaintenance or CordonAndDrainMaintenance respectively.
Once the maintenance mode is configured, it can take up to 10 minutes for the agent reconfiguration. Cluster that has successfully applied the maintenance mode capability will be shown with a MAINTENANCE mode label on cluster listing and on cluster overview page as below.
Cluster Listing

Cluster Overview


After configuring the cluster, an NGC Cluster Key will be generated for authenticating to NGC, and you will be presented with a command snippet for installing the NVIDIA Cluster Agent Operator. Please refer to this command snippet for the most up-to-date installation instructions.
The NGC Cluster Key has a default expiration of 90 days. Either on a regular cadence or when nearing expiration, you must rotate your NGC Cluster Key.
Once the Cluster Agent Operator installation is complete, the operator will automatically install the desired NVIDIA Cluster Agent version and the Status of the cluster in the Cluster Page will become “Ready”.
Afterward, you will be able to modify the configuration at any time. The cluster name and SSA client ID (only available for NVIDIA managed clusters) are not reconfigurable. Please refer to any additional installation instructions for reconfiguration in the UI. Once the configuration is updated, the Cluster Agent Operator, which polls for changes every 15 minutes, will apply the new configuration.
At any time, you can view the clusters you have begun registering, or registered, along with their status, on the Settings page.


Ready indicates the Cluster Agent has registered the cluster with NVCF successfully.Not Ready indicates the registration command has either just been applied and is in progress, or that registration is failing.While your cluster status may show as Ready, there may be a delay of up to 5 minutes before the “Cluster Agent Version” column updates from “Update in Progress”. You can verify the actual cluster health and version immediately using the terminal commands described in the next section.
In cases when registration is failing, please use the following command to retrieve additional details:
When a cluster is Not Ready, you can resume registration at any time to finish the installation.
The “GPU Utilization” column is based on the number of GPUs occupied over the number of GPUs available within the cluster. The “Last Connected” column indicates when the last status update was received from the Cluster Agent to the NVCF control plane.
Verify the installation was successful via the following command, you should see a “healthy” response, as in this example:
NVCA upgrades available are indicated in the UI by status icons next to the cluster’s NVCA version.

When triggering the update through the UI, it may take up to 1 minute for the status icon to change to the “updating” status. This delay is expected.
When an Operator upgrade is required, a helm install instruction will be generated in the UI. You must run this first in order to enable the NVCA upgrade.

Example:
You can verify that the upgrade is successful by running the following command and noting the “Image” field:
After the NVCA operator upgrade succeeds:
When an upgrade to the NVCA Operator is not required:
Verify that the operator has rolled out a successful upgrade by running the below command and looking for the spec version and status version to confirm the version of the CRD:
Verify NVCA is healthy and version matches the desired version:
In some cases, you may need to force NVCA to rollout immediately rather than waiting for the operator’s periodic reconciliation. This can be useful after configuration changes or when troubleshooting upgrade issues.
To force an immediate rollout:
This command updates the forcedRolloutAt annotation on all NVCFBackend resources, triggering the operator to immediately reconcile and apply any pending changes.
After forcing the rollout, you can monitor the progress:
Once the rollout is complete, verify the NVCA version has been updated:
The output should show the updated version in the VERSION column, and the HEALTH status should be healthy.
Cluster Keys are migrating to a organization-scoped service-key resource. If your cluster is using a personal key, it is recommended to follow the UI instructions for business continuity.
To regenerate or rotate a cluster’s key, choose the “Rotate Key” option from the Clusters listing on the Settings page. Please refer to this command snippet for the most up-to-date upgrade instructions.

Updating your Service Key may interrupt any in-progress updates or deployments to existing functions, therefore it’s important to pause deployments before upgrading.

A cluster key expiring soon will have an appropriate warning displayed as below and follow the UI instructions to Rotate Key as above.

My Cluster Key is a personal key, but it is valid for long time, should I still rotate ?
While the Cluster Key may still be valid, if you observe a directive it is recommended to rotate the key for business continuity
How many clusters can be registered in the Org ?
Number of cluster keys are capped at 50 per org and this includes any other service keys in use in the Org. If you observe error while registering cluster, contact the Org Admin using the Contact Admin option.

Removing a configured cluster as a deployment target for NVCF functions is a two step process. It involves deleting the cluster from NGC followed by executing a series of commands on the cluster that remove the installed Cluster Agent and the NVCA operator.
First, ensure all functions have been undeployed from the cluster. This can be done via the UI, API or CLI.
To delete all function pods in a namespace by force, use the following command. This is not a graceful operation.
Verify all pods have been terminated with the following command:
This should return no pods if deletion was successful. Additionally, check the logs of the Cluster Agent pod in the nvca-system namespace to ensure there are no more “Pod is being terminated” messages:
If pods are still hanging during termination, you can force the deletion of the namespace with the following command. This command removes the finalizers from the namespace metadata, which are hooks that prevent namespace deletion until certain cleanup tasks are complete. By removing these finalizers, you bypass the normal cleanup process and force the namespace to be deleted immediately.:
Deleting the Cluster will also delete the Cluster Key used by the cluster and is irreversible.
Next, reach the cluster registration page by navigating to Cloud Functions in the NGC product dropdown, and choosing “Settings” on the left-hand menu. You must be a Cloud Functions Admin to see this page.
Select “Delete Cluster” from the “Actions” dropdown for the cluster that you want to remove.

Following this, a dialog box will appear that asks to you confirm the deletion of the cluster. Click on “Delete” to remove the cluster as a deployment target.

Finally, on the registered cluster, execute the following commands to complete the deregistration process.
If the deletion of the Kubernetes CRD nvcfbackend blocks, then the finalizer (nvca-operator.finalizers.nvidia.io) needs to be manually deleted, for example using the following command: