Configure Cluster GPUs#

Learn how to configure your Kubernetes cluster so NeMo Customizer fine-tuning jobs schedule on the correct GPU nodes based on GPU type.

Configure GPU Node Selection#

The following are the steps to configure GPU node selection:

  1. Identify GPU node labels.

    List all nodes and their labels:

    kubectl get nodes --show-labels
    

    To see labels for a specific node:

    kubectl describe node <NODE_NAME> | grep nvidia.com/gpu.product
    
  2. Label a node (if needed).

    If your GPU nodes aren’t labeled, you can add a label. For example, to label a node as an A100 GPU node:

    kubectl label node <NODE_NAME> nvidia.com/gpu.product=NVIDIA-A100
    
  3. Edit Helm values or job template.

    In your values.yaml or job spec, set the tolerations and nodeSelectors fields to match the GPU type you want to target. This is a global, app-wide setting—all jobs will inherit these values by default.

    However, you can override these settings for specific jobs by specifying tolerations and nodeSelectors in a config template, such as in customizationConfigTemplates. For example:

    customizationConfigTemplates:
      overrideExistingTemplates: true
      templates:
        meta/llama-3.2-3b-instruct@v1.0.0+A100:
          training_options:
            - training_type: sft
              finetuning_type: lora
              num_gpus: 1
              num_nodes: 1
              tensor_parallel_size: 1
              micro_batch_size: 1
          max_seq_length: 4096
          pod_spec:
            tolerations:
              - key: app
                operator: Equal
                value: customizer
                effect: NoSchedule
    

    When a job launches using a template with a pod_spec, the tolerations and nodeSelectors defined, it overrides the global values for that specific job.

    customizer:
      tolerations:
        - key: "nvidia.com/gpu.product"
          operator: "Equal"
          value: "NVIDIA-A100"
          effect: "NoSchedule"
      nodeSelectors:
        nvidia.com/gpu.product: "NVIDIA-A100"
    
    customizer:
      tolerations:
        - key: "nvidia.com/gpu.product"
          operator: "Equal"
          value: "NVIDIA-H100"
          effect: "NoSchedule"
      nodeSelectors:
        nvidia.com/gpu.product: "NVIDIA-H100"
    
    customizer:
      tolerations:
        - key: "nvidia.com/gpu.product"
          operator: "Equal"
          value: "NVIDIA-L40"
          effect: "NoSchedule"
      nodeSelectors:
        nvidia.com/gpu.product: "NVIDIA-L40"
    
  4. Apply the configuration.

    helm upgrade --install <release-name> <chart-path> -f values.yaml
    
  5. Verify pod scheduling.

    1. Check that your pods run on the intended GPU nodes:

    kubectl get pods -o wide
    
    1. Check the node a specific pod runs on:

    kubectl describe pod <POD_NAME> | grep Node:
    
    1. If pods don’t schedule as expected, ensure your node labels and configuration are correct.


Configure Job Pod Tolerations#

To allow Customizer jobs to run on specific nodes using Kubernetes taints and tolerations, follow the steps below.

  1. Prepare the target node. Drain existing workloads from the target node:

    kubectl get pods --all-namespaces --field-selector spec.nodeName=<TARGET_NODE>
    kubectl drain <TARGET_NODE> --ignore-daemonsets
    
  2. Apply node taint. Add a taint to the node:

    kubectl taint nodes <TARGET_NODE> app=customizer:NoSchedule
    
  3. Add matching tolerations. You can configure tolerations in two ways: globally for all jobs or locally for specific customization config templates. When both configurations exist, their values combine.

    1. To apply tolerations globally for all jobs, use the customizer global customizerConfig.tolerations setting in values.yaml.

    customizerConfig:
      tolerations:
        - key: app
          value: customizer
          operator: Equal
          effect: NoSchedule
    
    1. To apply tolerations locally for specific customization config templates, use the customizerConfig.customizationConfigTemplates.templates[i].pod_spec.tolerations setting in values.yaml.

    customizerConfig:
      customizationConfigTemplates:
       templates:
         meta/llama-3.2-1b-instruct@2.0+A100:
           name: llama-3.2-1b-instruct@2.0+A100
           namespace: meta
           target: meta/llama-3.2-1b-instruct@2.0
           training_options:
             - training_type: sft
               finetuning_type: lora
               num_gpus: 1
               micro_batch_size: 1
           max_seq_length: 4096
           pod_spec:
             tolerations:
               - key: app
                 operator: Equal
                 value: customizer
                 effect: NoSchedule
           prompt_template: "{prompt} {completion}"
    

    You can also specify tolerations through the Create Customization Config API.

    When to use each option:

    • Use global tolerations when you want all customization jobs to run on the nodes with corresponding taints.

    • Use local tolerations when only specific model configurations should run on the nodes with corresponding taints.

    • Use both when you want to combine global and local tolerations.

For more details, see Kubernetes taints and tolerations.


Configure Job Pod Node Selectors#

To schedule Customizer jobs on specific nodes using Kubernetes node selectors, follow the steps below.

  1. Prepare the target node. Add label to the target node:

    kubectl label node <TARGET_NODE> job-type=customizer
    
  2. Add matching node selectors. You can configure node selectors in two ways: globally for all jobs or locally for specific customization config templates. When both configurations exist, their values combine.

    1. To apply node selectors globally for all jobs, use the customizer global customizerConfig.nodeSelectors setting in values.yaml.

    customizerConfig:
      nodeSelectors:
        job-type: customizer
    
    1. To apply node selectors locally for specific customization config templates, use the customizerConfig.customizationConfigTemplates.templates[i].pod_spec.nodeSelectors setting in values.yaml.

    customizerConfig:
      customizationConfigTemplates:
       templates:
         meta/llama-3.2-1b-instruct@2.0+A100:
           name: llama-3.2-1b-instruct@2.0+A100
           namespace: meta
           target: meta/llama-3.2-1b-instruct@2.0
           training_options:
             - training_type: sft
               finetuning_type: lora
               num_gpus: 1
               micro_batch_size: 1
           max_seq_length: 4096
           pod_spec:
             nodeSelectors:
               job-type: customizer
           prompt_template: "{prompt} {completion}"
    

    You can also specify node selectors through the Create Customization Config API.

For more details, see Kubernetes node selectors.


Configuration Tips#

The following are the tips for configuring node selectors.

Configuring Node Selectors Globally or Locally#

  • Configure global node selectors to schedule all customization jobs on dedicated nodes.

  • Configure local node selectors to schedule specific model configurations on dedicated nodes.

  • Configure both global and local node selectors to combine their scheduling rules.

Configuring Node Selectors and Taints#

  • For exclusive GPU node allocation with workload isolation, implement both node selector and taint-and-toleration configurations.

  • For non-exclusive GPU node access where multiple workloads can share the node, implement only node selector configuration.

  • For flexible workload scheduling across different node types, implement only taint-and-toleration configuration.