Considerations when Installing with Outdated Kernels in Cluster
The driver
container deployed as part of the GPU Operator requires certain packages to be available as part of the driver installation.
On GPU nodes where the running kernel is not the latest, the driver
container may fail to find the right version of these packages
(e.g. kernel-headers, kernel-devel) that correspond to the running kernel version. In the driver
container logs, you will most likely
see the following error message: Could not resolve Linux kernel version
.
In general, upgrading your system to the latest kernel should fix this issue. But if this is not an option, the following is a workaround to successfully deploy the GPU operator when GPU nodes in your cluster may not be running the latest kernel.
Add Archived Package Repositories
The workaround is to find the package archive containing packages for your outdated kernel and to add this repository to the package
manager running inside the driver
container. To achieve this, we can simply mount a repository list file into the driver
container using a ConfigMap
.
The ConfigMap
containing the repository list file needs to be created in the gpu-operator
namespace.
Let us demonstrate this workaround via an example. The system used in this example is running CentOS 7 with an outdated kernel:
$ uname -r
3.10.0-1062.12.1.el7.x86_64
The official archive for older CentOS packages is https://vault.centos.org/. Typically, most archived CentOS repositories
are found in /etc/yum.repos.d/CentOS-Vault.repo
but they are disabled by default. If the appropriate archive repository
was enabled, then the driver
container would resolve the kernel version and be able to install the correct versions
of the prerequisite packages.
We can simply drop in a replacement of /etc/yum.repos.d/CentOS-Vault.repo
to ensure the appropriate CentOS archive is enabled.
For the kernel running in this example, the CentOS-7.7.1908
archive contains the kernel-headers version we are looking for.
Here is our example drop-in replacement file:
[C7.7.1908-base]
name=CentOS-7.7.1908 - Base
baseurl=http://vault.centos.org/7.7.1908/os/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=1
[C7.7.1908-updates]
name=CentOS-7.7.1908 - Updates
baseurl=http://vault.centos.org/7.7.1908/updates/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=1
Once the repo list file is created, we can create a ConfigMap
for it:
$ kubectl create configmap repo-config -n gpu-operator --from-file=<path-to-repo-list-file>
Once the ConfigMap
is created using the above command, update values.yaml
with this information, to let the GPU Operator mount the repo configuration
within the driver
container to pull required packages.
For Ubuntu:
driver:
repoConfig:
configMapName: repo-config
destinationDir: /etc/apt/sources.list.d
For RHEL/Centos/RHCOS:
driver:
repoConfig:
configMapName: repo-config
destinationDir: /etc/yum.repos.d
Deploy GPU Operator with updated values.yaml
:
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
-f values.yaml
Check the status of the pods to ensure all the containers are running:
$ kubectl get pods -n gpu-operator