Configure Cluster GPUs#

Learn how to configure your Kubernetes cluster so NeMo Customizer fine-tuning jobs schedule on the correct GPU nodes based on GPU type.

Configure GPU Node Selection#

The following are the steps to configure GPU node selection:

Identify GPU node labels.

List all nodes and their labels:

kubectl get nodes --show-labels

To see labels for a specific node:

kubectl describe node <NODE_NAME> | grep nvidia.com/gpu.product

Label a node (if needed).

If your GPU nodes aren’t labeled, you can add a label. For example, to label a node as an A100 GPU node:
```
kubectl label node <NODE_NAME> nvidia.com/gpu.product=NVIDIA-A100
```

Edit Helm values or job template.

In your values.yaml or job spec, set the tolerations and nodeSelectors fields to match the GPU type you want to target. This is a global, app-wide setting—all jobs will inherit these values by default.

However, you can override these settings for specific jobs by specifying tolerations and nodeSelectors in a config template, such as in customizationConfigTemplates. For example:

customizationConfigTemplates:
  overrideExistingTemplates: true
  templates:
    meta/llama-3.2-3b-instruct@v1.0.0+A100:
      training_options:
        - training_type: sft
          finetuning_type: lora
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          micro_batch_size: 1
      max_seq_length: 4096
      pod_spec:
        tolerations:
          - key: app
            operator: Equal
            value: customizer
            effect: NoSchedule

When a job launches using a template with a pod_spec, the tolerations and nodeSelectors defined, it overrides the global values for that specific job.

A100

customizer:
  tolerations:
    - key: "nvidia.com/gpu.product"
      operator: "Equal"
      value: "NVIDIA-A100"
      effect: "NoSchedule"
  nodeSelectors:
    nvidia.com/gpu.product: "NVIDIA-A100"

H100

customizer:
  tolerations:
    - key: "nvidia.com/gpu.product"
      operator: "Equal"
      value: "NVIDIA-H100"
      effect: "NoSchedule"
  nodeSelectors:
    nvidia.com/gpu.product: "NVIDIA-H100"

L40

customizer:
  tolerations:
    - key: "nvidia.com/gpu.product"
      operator: "Equal"
      value: "NVIDIA-L40"
      effect: "NoSchedule"
  nodeSelectors:
    nvidia.com/gpu.product: "NVIDIA-L40"

Apply the configuration.

helm upgrade --install <release-name> <chart-path> -f values.yaml

Verify pod scheduling.
1. Check that your pods run on the intended GPU nodes:
```
kubectl get pods -o wide
```
1. Check the node a specific pod runs on:
```
kubectl describe pod <POD_NAME> | grep Node:
```
1. If pods don’t schedule as expected, ensure your node labels and configuration are correct.

Configure Job Pod Tolerations#

To allow Customizer jobs to run on specific nodes using Kubernetes taints and tolerations, follow the steps below.

Prepare the target node. Drain existing workloads from the target node:

kubectl get pods --all-namespaces --field-selector spec.nodeName=<TARGET_NODE>
kubectl drain <TARGET_NODE> --ignore-daemonsets

Apply node taint. Add a taint to the node:

kubectl taint nodes <TARGET_NODE> app=customizer:NoSchedule

Add matching tolerations. You can configure tolerations in two ways: globally for all jobs or locally for specific customization config templates. When both configurations exist, their values combine.

To apply tolerations globally for all jobs, use the customizer global customizerConfig.tolerations setting in values.yaml.

customizerConfig:
  tolerations:
    - key: app
      value: customizer
      operator: Equal
      effect: NoSchedule

To apply tolerations locally for specific customization config templates, use the customizerConfig.customizationConfigTemplates.templates[i].pod_spec.tolerations setting in values.yaml.

customizerConfig:
  customizationConfigTemplates:
   templates:
     meta/llama-3.2-1b-instruct@2.0+A100:
       name: llama-3.2-1b-instruct@2.0+A100
       namespace: meta
       target: meta/llama-3.2-1b-instruct@2.0
       training_options:
         - training_type: sft
           finetuning_type: lora
           num_gpus: 1
           micro_batch_size: 1
       max_seq_length: 4096
       pod_spec:
         tolerations:
           - key: app
             operator: Equal
             value: customizer
             effect: NoSchedule
       prompt_template: "{prompt} {completion}"

You can also specify tolerations through the Create Customization Config API.

When to use each option:

Use global tolerations when you want all customization jobs to run on the nodes with corresponding taints.
Use local tolerations when only specific model configurations should run on the nodes with corresponding taints.
Use both when you want to combine global and local tolerations.

For more details, see Kubernetes taints and tolerations.

Configure Job Pod Node Selectors#

To schedule Customizer jobs on specific nodes using Kubernetes node selectors, follow the steps below.

Prepare the target node. Add label to the target node:

kubectl label node <TARGET_NODE> job-type=customizer

Add matching node selectors. You can configure node selectors in two ways: globally for all jobs or locally for specific customization config templates. When both configurations exist, their values combine.

To apply node selectors globally for all jobs, use the customizer global customizerConfig.nodeSelectors setting in values.yaml.

customizerConfig:
  nodeSelectors:
    job-type: customizer

To apply node selectors locally for specific customization config templates, use the customizerConfig.customizationConfigTemplates.templates[i].pod_spec.nodeSelectors setting in values.yaml.

customizerConfig:
  customizationConfigTemplates:
   templates:
     meta/llama-3.2-1b-instruct@2.0+A100:
       name: llama-3.2-1b-instruct@2.0+A100
       namespace: meta
       target: meta/llama-3.2-1b-instruct@2.0
       training_options:
         - training_type: sft
           finetuning_type: lora
           num_gpus: 1
           micro_batch_size: 1
       max_seq_length: 4096
       pod_spec:
         nodeSelectors:
           job-type: customizer
       prompt_template: "{prompt} {completion}"

You can also specify node selectors through the Create Customization Config API.

For more details, see Kubernetes node selectors.

Configuration Tips#

The following are the tips for configuring node selectors.

Configuring Node Selectors Globally or Locally#

Configure global node selectors to schedule all customization jobs on dedicated nodes.
Configure local node selectors to schedule specific model configurations on dedicated nodes.
Configure both global and local node selectors to combine their scheduling rules.

Configuring Node Selectors and Taints#

For exclusive GPU node allocation with workload isolation, implement both node selector and taint-and-toleration configurations.
For non-exclusive GPU node access where multiple workloads can share the node, implement only node selector configuration.
For flexible workload scheduling across different node types, implement only taint-and-toleration configuration.