Manage GPUs#

Configure and optimize your NeMo microservices deployment for different NVIDIA GPU types, including A100, H100, and L40. This enables fine-tuning jobs by ensuring GPU nodes are available in your Kubernetes cluster.

For configuring GPU and training options for specific models, see the Model Configurations page.

About Workloads#

NeMo Customizer workloads are GPU-intensive and won’t function or will be extremely slow without access to the requested GPU resources.

Each Customizer fine-tuning job runs creates one or more Kubernetes pods. The training_options configuration you set (such as num_gpus, num_nodes, nodeSelectors, and tolerations) applies to that job. num_nodes specifies how many will be used for fine-tuning and is equal to the number of created Kubernetes pods. If you run multiple jobs, there is no resource sharing between jobs at the pod level — each pod requests and uses the GPUs you specify. The total number of GPUs used = num_nodes x num_gpus.

Proper configuration ensures Kubernetes schedules these jobs on nodes with available GPUs, and that job gets the right type and amount of GPU. Different models may require different numbers or types of GPUs, so set these values according to your model requirements and available hardware.

User-Set Overrides#

Resource configurations, such as GPU type and count, are typically defined by administrators using the targets and configs Customizer API API. Admins set up available modes and resource options that align with the cluster’s hardware and organizational policies.

When submitting a fine-tuning job, users select from these pre-defined configurations rather than specifying arbitrary resource values. This ensures resource usage is controlled and consistent with admin policies. If a user needs a different configuration, they must request an admin to create or modify a config.

Model Configurations Matrix

Review the recommended default settings for each supported model.

Model Configurations Matrix

Configure Cluster GPUs

Learn how to configure your Kubernetes cluster and Helm charts for specific GPU types.

Configure Cluster GPUs

Troubleshooting

Solutions to common issues with GPU scheduling and configuration.

Troubleshooting GPU Jobs