Precompiled Driver Containers

About Precompiled Driver Containers

Note

Technology Preview features are not supported in production environments and are not functionally complete. Technology Preview features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. These releases may not have any documentation, and testing is limited.

Containers with precompiled drivers do not require internet access to download Linux kernel header files, GCC compiler tooling, or operating system packages.

Using precompiled drivers also avoids the burst of compute demand that is required to compile the kernel drivers with the conventional driver containers.

These two benefits are valuable to most sites, but are especially beneficial to sites with restricted internet access or sites with resource-constrained hardware.

Limitations and Restrictions

  • Support for deploying the driver containers with precompiled drivers is limited to hosts with the Ubuntu 22.04 operating system and x86_64 architecture.

    For information about using precompiled drivers with OpenShift Container Platform, refer to Precompiled Drivers for the NVIDIA GPU Operator for RHCOS.

  • NVIDIA supports precompiled driver containers for the most recently released long-term servicing branch (LTSB) driver branch, 525.

  • NVIDIA builds images for the generic kernel variant. If your hosts run a different kernel variant, you can build a precompiled driver image and use your own container registry.

  • Precompiled driver containers do not support NVIDIA vGPU or GPUDirect Storage (GDS).

Determining if a Precompiled Driver Container is Available

The precompiled driver containers are named according to the following pattern:

<driver-branch>-<linux-kernel-version>-<os-tag>

For example, 525-5.15.0-69-generic-ubuntu22.04.

Use one of the following ways to check if a driver container is available for your Linux kernel and driver branch:

  • Use a web browser to access the NVIDIA GPU Driver page of the NVIDIA GPU Cloud registry at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags. Use the search field to filter the tags by your operating system version.

  • Use the NGC CLI tool to list the tags for the driver container:

    $ ngc registry image info nvidia/driver
    

    Example Output

    Image Repository Information
      Name: driver
      Display Name: NVIDIA GPU Driver
      Short Description: Provision NVIDIA GPU Driver as a Container.
      Built By: NVIDIA
      Publisher: NVIDIA
      Multinode Support: False
      Multi-Arch Support: True
      Logo: https://assets.nvidiagrid.net/ngc/logos/Infrastructure.png
      Labels: Multi-Arch, NVIDIA AI Enterprise Supported, Infrastructure Software, Kubernetes Infrastructure
      Public: Yes
      Last Updated: Apr 20, 2023
      Latest Image Size: 688.87 MB
      Latest Tag: 525-5.15.0-69-generic-ubuntu22.04
      Tags:
          525-5.15.0-69-generic-ubuntu22.04
          525-5.15.0-70-generic-ubuntu22.04
          ...
    

Enabling Precompiled Driver Container Support During Installation

Refer to the common instructions for installing the Operator with Helm at Installing the NVIDIA GPU Operator. Specify the --set driver.usePrecompiled=true and --set driver.version=<driver-branch> arguments like the following example command:

$ helm install --wait gpu-operator \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.usePrecompiled=true \
     --set driver.version="<driver-branch>"

Specify a value like 525 for <driver-branch>. Refer to Chart Customization Options for information about other installation options.

Enabling Support After Installation

Perform the following steps to enable support for precompiled driver containers:

  1. Enable support by modifying the cluster policy:

    $ kubectl patch clusterpolicy/cluster-policy --type='json' \
       -p='[
         {"op":"replace", "path":"/spec/driver/usePrecompiled", "value":true},
         {"op":"replace", "path":"/spec/driver/version", "value":"<driver-branch>"}
       ]'
    

    Specify a value like 525 for <driver-branch>.

    Example Output

    clusterpolicy.nvidia.com/cluster-policy patched
    
  2. (Optional) Confirm that the driver daemonset pods terminate:

    $ kubectl get pods -n gpu-operator
    

    Example Output

    NAME                                                              READY   STATUS        RESTARTS   AGE
    pod/gpu-feature-discovery-pzzr8                                   2/2     Running       0          19m
    pod/gpu-operator-859cb64846-57hfn                                 1/1     Running       0          47m
    pod/gpu-operator-node-feature-discovery-master-6d6649d597-7l8bj   1/1     Running       0          10d
    pod/gpu-operator-node-feature-discovery-worker-v86vj              1/1     Running       0          10d
    pod/nvidia-container-toolkit-daemonset-6ltbv                      1/1     Running       0          19m
    pod/nvidia-cuda-validator-62w6r                                   0/1     Completed     0          17m
    pod/nvidia-dcgm-exporter-fh5wz                                    1/1     Running       0          19m
    pod/nvidia-device-plugin-daemonset-rwslh                          2/2     Running       0          19m
    pod/nvidia-device-plugin-validator-gq4ww                          0/1     Completed     0          17m
    pod/nvidia-driver-daemonset-xqrxk                                 1/1     Terminating   0          20m
    pod/nvidia-operator-validator-78mzv                               1/1     Running       0          19m
    
  3. Confirm that the driver container pods are running:

    $ kubectl get pods -l app=nvidia-driver-daemonset -n gpu-operator
    

    Example Output

    NAME                                                          READY   STATUS    RESTARTS   AGE
    nvidia-driver-daemonset-5.15.0-69-generic-ubuntu22.04-thbts   1/1     Running   0          44s
    

    Ensure that the pod names include a Linux kernel semantic version number like 5.15.0-69-generic.

Disabling Support for Precompiled Driver Containers

Perform the following steps to disable support for precompiled driver containers:

  1. Disable support by modifying the cluster policy:

    $ kubectl patch clusterpolicy/cluster-policy --type='json' \
        -p='[{"op": "replace", "path": "/spec/driver/usePrecompiled", "value":false}]'
    

    Example Output

    clusterpolicy.nvidia.com/cluster-policy patched
    
  2. Confirm that the conventional driver container pods are running:

    $ kubectl get pods -l app=nvidia-driver-daemonset -n gpu-operator
    

    Example Output

    NAME                            READY   STATUS    RESTARTS   AGE
    nvidia-driver-daemonset-qwprp   1/1     Running   0          10m
    

    Ensure that the pod names do not include a Linux kernel semantic version number.

Building a Custom Driver Container Image

If a precompiled driver container for your Linux kernel variant is not available, you can perform the following steps to build and run a container image.

Note

NVIDIA provides limited support for custom driver container images.

Prerequisites

  • You have access to a private container registry, such as NVIDIA NGC Private Registry, and can push container images to the registry.

  • Your build machine has access to the internet to download operating system packages.

  • You know a CUDA version, such as 12.1.0, that you want to use. The CUDA version only specifies which base image is used to build the driver container. The version does not have any correlation to the version of CUDA that is associated with or supported by the resulting driver container.

    One way to find a supported CUDA version for your operating system is to access the NVIDIA GPU Cloud registry at https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags and view the tags. Use the search field to filter the tags, such as base-ubuntu22.04. The filtered results show the CUDA versions, such as 12.1.0, 12.0.1, 12.0.0, and so on.

  • You know the GPU driver branch, such as 525, that you want to use.

Procedure

  1. Clone the driver container repository and change directory into the repository:

    $ git clone https://gitlab.com/nvidia/container-images/driver
    
    $ cd driver
    
  2. Change directory to the operating system name and version under the driver directory:

    $ cd ubuntu22.04/precompiled
    
  3. Set environment variables for building the driver container image.

    • Specify your private registry URL:

      $ export PRIVATE_REGISTRY=<private-registry-url>
      
    • Specify the KERNEL_VERSION environment variable that matches your kernel variant, such as 5.15.0-1033-aws:

      $ export KERNEL_VERSION=5.15.0-1033-aws
      
    • Specify the version of the CUDA base image to use when building the driver container:

      $ export CUDA_VERSION=12.1.0
      
    • Specify the driver branch, such as 525:

      $ export DRIVER_BRANCH=525
      
    • Specify the OS_TAG environment variable to identify the guest operating system name and version:

      $ export OS_TAG=ubuntu22.04
      

      The value must match the guest operating system version.

  4. Build the driver container image:

    $ sudo docker build \
        --build-arg KERNEL_VERSION=$KERNEL_VERSION \
        --build-arg CUDA_VERSION=$CUDA_VERSION \
        --build-arg DRIVER_BRANCH=$DRIVER_BRANCH \
        -t ${PRIVATE_REGISTRY}/driver:${DRIVER_BRANCH}-${KERNEL_VERSION}-${OS_TAG} .
    
  5. Push the driver container image to your private registry.

    • Log in to your private registry:

      $ sudo docker login ${PRIVATE_REGISTRY} --username=<username>
      

      Enter your password when prompted.

    • Push the driver container image to your private registry:

      $ sudo docker push ${PRIVATE_REGISTRY}/driver:${DRIVER_BRANCH}-${KERNEL_VERSION}-${OS_TAG}
      

Next Steps

  • To use the custom driver container image, follow the steps for enabling support during or after installation.

    If you have not already installed the GPU Operator, in addition to the --set driver.usePrecompiled=true and --set driver.version=${DRIVER_BRANCH} arguments for Helm, also specify the --set driver.repository="$PRIVATE_REGISTRY" argument.

    If the container registry is not public, you need to create an image pull secret in the GPU operator namespace and specify it in the --set driver.imagePullSecrets=<pull-secret> argument.

    If you already installed the GPU Operator, specify the private registry for the driver in the cluster policy:

    $ kubectl patch clusterpolicy/cluster-policy --type='json' \
        -p='[{"op": "replace", "path": "/spec/driver/repository", "value":"$PRIVATE_REGISTRY"}]'