Configure Cluster GPUs#
Learn how to configure your Kubernetes cluster so NeMo Customizer fine-tuning jobs schedule on the correct GPU nodes based on GPU type.
Configure GPU Node Selection#
The following are the steps to configure GPU node selection:
Identify GPU node labels.
List all nodes and their labels:
kubectl get nodes --show-labels
To see labels for a specific node:
kubectl describe node <NODE_NAME> | grep nvidia.com/gpu.product
Label a node (if needed).
If your GPU nodes aren’t labeled, you can add a label. For example, to label a node as an A100 GPU node:
kubectl label node <NODE_NAME> nvidia.com/gpu.product=NVIDIA-A100
Edit Helm values or job template.
In your
values.yaml
or job spec, set thetolerations
andnodeSelectors
fields to match the GPU type you want to target. This is a global, app-wide setting—all jobs will inherit these values by default.However, you can override these settings for specific jobs by specifying
tolerations
andnodeSelectors
in a config template, such as incustomizationConfigTemplates
. For example:customizationConfigTemplates: overrideExistingTemplates: true templates: meta/llama-3.2-3b-instruct@v1.0.0+A100: training_options: - training_type: sft finetuning_type: lora num_gpus: 1 num_nodes: 1 tensor_parallel_size: 1 micro_batch_size: 1 max_seq_length: 4096 pod_spec: tolerations: - key: app operator: Equal value: customizer effect: NoSchedule
When a job launches using a template with a
pod_spec
, thetolerations
andnodeSelectors
defined, it overrides the global values for that specific job.customizer: tolerations: - key: "nvidia.com/gpu.product" operator: "Equal" value: "NVIDIA-A100" effect: "NoSchedule" nodeSelectors: nvidia.com/gpu.product: "NVIDIA-A100"
customizer: tolerations: - key: "nvidia.com/gpu.product" operator: "Equal" value: "NVIDIA-H100" effect: "NoSchedule" nodeSelectors: nvidia.com/gpu.product: "NVIDIA-H100"
customizer: tolerations: - key: "nvidia.com/gpu.product" operator: "Equal" value: "NVIDIA-L40" effect: "NoSchedule" nodeSelectors: nvidia.com/gpu.product: "NVIDIA-L40"
Apply the configuration.
helm upgrade --install <release-name> <chart-path> -f values.yaml
Verify pod scheduling.
Check that your pods run on the intended GPU nodes:
kubectl get pods -o wide
Check the node a specific pod runs on:
kubectl describe pod <POD_NAME> | grep Node:
If pods don’t schedule as expected, ensure your node labels and configuration are correct.
Configure Job Pod Tolerations#
To allow Customizer jobs to run on specific nodes using Kubernetes taints and tolerations, follow the steps below.
Prepare the target node. Drain existing workloads from the target node:
kubectl get pods --all-namespaces --field-selector spec.nodeName=<TARGET_NODE> kubectl drain <TARGET_NODE> --ignore-daemonsets
Apply node taint. Add a taint to the node:
kubectl taint nodes <TARGET_NODE> app=customizer:NoSchedule
Add matching tolerations. You can configure tolerations in two ways: globally for all jobs or locally for specific customization config templates. When both configurations exist, their values combine.
To apply tolerations globally for all jobs, use the customizer global
customizerConfig.tolerations
setting invalues.yaml
.
customizerConfig: tolerations: - key: app value: customizer operator: Equal effect: NoSchedule
To apply tolerations locally for specific customization config templates, use the
customizerConfig.customizationConfigTemplates.templates[i].pod_spec.tolerations
setting invalues.yaml
.
customizerConfig: customizationConfigTemplates: templates: meta/llama-3.2-1b-instruct@2.0+A100: name: llama-3.2-1b-instruct@2.0+A100 namespace: meta target: meta/llama-3.2-1b-instruct@2.0 training_options: - training_type: sft finetuning_type: lora num_gpus: 1 micro_batch_size: 1 max_seq_length: 4096 pod_spec: tolerations: - key: app operator: Equal value: customizer effect: NoSchedule prompt_template: "{prompt} {completion}"
You can also specify tolerations through the Create Customization Config API.
When to use each option:
Use global tolerations when you want all customization jobs to run on the nodes with corresponding taints.
Use local tolerations when only specific model configurations should run on the nodes with corresponding taints.
Use both when you want to combine global and local tolerations.
For more details, see Kubernetes taints and tolerations.
Configure Job Pod Node Selectors#
To schedule Customizer jobs on specific nodes using Kubernetes node selectors, follow the steps below.
Prepare the target node. Add label to the target node:
kubectl label node <TARGET_NODE> job-type=customizer
Add matching node selectors. You can configure node selectors in two ways: globally for all jobs or locally for specific customization config templates. When both configurations exist, their values combine.
To apply node selectors globally for all jobs, use the customizer global
customizerConfig.nodeSelectors
setting invalues.yaml
.
customizerConfig: nodeSelectors: job-type: customizer
To apply node selectors locally for specific customization config templates, use the
customizerConfig.customizationConfigTemplates.templates[i].pod_spec.nodeSelectors
setting invalues.yaml
.
customizerConfig: customizationConfigTemplates: templates: meta/llama-3.2-1b-instruct@2.0+A100: name: llama-3.2-1b-instruct@2.0+A100 namespace: meta target: meta/llama-3.2-1b-instruct@2.0 training_options: - training_type: sft finetuning_type: lora num_gpus: 1 micro_batch_size: 1 max_seq_length: 4096 pod_spec: nodeSelectors: job-type: customizer prompt_template: "{prompt} {completion}"
You can also specify node selectors through the Create Customization Config API.
For more details, see Kubernetes node selectors.
Configuration Tips#
The following are the tips for configuring node selectors.
Configuring Node Selectors Globally or Locally#
Configure global node selectors to schedule all customization jobs on dedicated nodes.
Configure local node selectors to schedule specific model configurations on dedicated nodes.
Configure both global and local node selectors to combine their scheduling rules.
Configuring Node Selectors and Taints#
For exclusive GPU node allocation with workload isolation, implement both node selector and taint-and-toleration configurations.
For non-exclusive GPU node access where multiple workloads can share the node, implement only node selector configuration.
For flexible workload scheduling across different node types, implement only taint-and-toleration configuration.