Fake GPU Operator (Development / Testing)
Fake GPU Operator (Development / Testing)
Fake GPU Operator (Development / Testing)
For development, staging, load testing, or CI environments that lack physical NVIDIA GPUs, you can install a fake GPU operator to simulate GPU resources on cluster nodes. This allows the NVCA agent to discover GPUs and manage function deployments without actual GPU hardware.
The fake GPU operator is for non-production use only. For production deployments with real GPUs, install the NVIDIA GPU Operator.
kubectl accesshelm >= 3.12The fake GPU operator depends on KWOK (Kubernetes Without Kubelet) to manage simulated GPU device plugins on nodes. Install KWOK before the fake GPU operator:
Verify the KWOK controller is running:
The KWOK install may produce a FlowSchema error
(creation or update of FlowSchema object ... is not allowed).
This is non-critical and can be safely ignored.
Add the RunAI helm repository and install the fake GPU operator:
This configures one node pool named default with 8 simulated H100 GPUs per node.
topology.nodePools must be a map, not an array.
Using array index syntax (--set 'topology.nodePools[0].gpuCount=8') will create a
YAML array instead of a map and cause the status-updater to fail with:
Always use named keys: topology.nodePools.default.gpuCount=8.
The fake GPU operator watches for nodes with the label
run.ai/simulated-gpu-node-pool=<pool-name> and patches their status to advertise
fake nvidia.com/gpu extended resources. You must label the nodes that should receive
simulated GPUs:
The pool name (default) must match a key in topology.nodePools from the helm install.
The NVCA agent uses several GPU metadata labels for dynamic discovery. On real GPU nodes these are set by the NVIDIA GPU Operator. To suppress warnings from NVCA on fake GPU nodes, add the following labels:
Adjust the values to match the GPU product you configured (e.g., ampere for A100,
ada for L40S).
The fake GPU operator chart creates RuntimeClass/nvidia and several namespaced resources in gpu-operator. Helm fails with invalid ownership metadata if one of those objects already exists and is not owned by release gpu-operator in namespace gpu-operator.
For local k3d development, the recovery workflow is to rerun make build-and-deploy-cluster in tools/ncp-local-cluster/, which removes known stale fake GPU operator resources without deleting the cluster. For manual chart debugging, inspect ownership before deleting anything. If another Helm release owns the resource, remove that release instead of deleting the resource directly.
Check that the fake GPU operator pods are running:
Confirm that labeled nodes now advertise GPU resources:
If GPUs do not appear, verify the node has the run.ai/simulated-gpu-node-pool=default
label and that the status-updater pod is not in an error state.
For the smoothest experience, install the fake GPU operator before running
helmfile sync. This way the NVCA agent discovers GPUs on its first boot and no
re-registration is needed.
The recommended sequence is:
nvidia.com/gpu appears in node allocatable resourcesIf you add the fake GPU operator to a cluster that already has NVCF deployed, the NVCA agent will be crash-looping because it cannot find GPUs. After installing the fake GPU operator and verifying GPUs appear on nodes, re-register the cluster and restart the operator:
The operator restart will re-run the bootstrap init container, recreate the NVCFBackend resource, and spawn a fresh NVCA agent pod that discovers the simulated GPUs.
For details on the bootstrap process, see Self-Managed Clusters (Manual Cluster Registration).
Adjust the GPU count, product name, and memory per node pool:
Define multiple pools with different GPU configurations by using different map keys:
Then label nodes with the corresponding pool name:
To remove the fake GPU operator and all simulated GPU resources:
After removing the fake GPU operator, the NVCA agent will lose GPU visibility and begin crash-looping. Either install a real GPU Operator with physical GPUs or uninstall the NVCA operator.