Software Deployment Workflow#

The CUDA software environment consists of three parts:

CUDA Toolkit (libraries, runtime and tools) - User-mode SDK used to build CUDA applications
CUDA driver - User-mode driver component used to run CUDA applications (for example, libcuda.so on Linux systems)
NVIDIA GPU device driver - Kernel-mode driver component for NVIDIA GPUs

On Linux systems, the CUDA driver and kernel mode components are delivered together in the NVIDIA display driver package. This is shown in the following figure.

The CUDA Toolkit is generally optional when GPU nodes are only used to run applications (as opposed to develop applications) as the CUDA application typically packages (by statically or dynamically linking against) the CUDA runtime and libraries needed.

Typical Workflow#

A typical suggested workflow for bootstrapping a GPU node in a cluster:

Install the NVIDIA drivers (do not install CUDA Toolkit as this brings in additional dependencies that may not be necessary or desired)
Install the CUDA Toolkit using meta-packages. This provides additional control over what is installed on the system.
Install other components such as cuDNN or TensorRT as desired depending on the application requirements and dependencies.

Data Center Driver Installation#

Note

The full content of this section is available at: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html.

NVIDIA drivers are available in three formats for use with Linux distributions:

NVIDIA provides Linux distribution specific packages for drivers that can be used by customers to deploy drivers into a production environment. The links above provide detailed information and steps on how to install driver packages for supported Linux distributions, but a summary is provided below.

Installation Using Package Managers#

Using package managers is the recommended method of installing drivers as this provides additional control over choice of driver branches, precompiled kernel modules, driver upgrades and additional dependencies such as Fabric Manager/NSCQ for NVSwitch systems.

On Ubuntu LTS

sudo apt-get -y install cuda-drivers-<branch-number>

Where the branch-number = the specific data center branch of interest (for example, 570).

On RHEL 8

sudo dnf module install nvidia-driver:<stream>/<profile>

For example, nvidia-driver:latest-dkms/fm will install the latest drivers and also install the Fabric Manager dependencies to bootstrap an NVSwitch system such as HGX A100.

For more information on the supported streams/profiles, refer to this section in the documentation.

CUDA Toolkit Installation#

The CUDA Toolkit packages are modular and offer the user control over what components of the CUDA Toolkit are installed on the system. CUDA supports a number of meta-packages that are described here.

Since the cuda or cuda-<release> packages also install the drivers, these packages may not be appropriate for data center deployments.

Instead, other packages such as cuda-toolkit-<release> should be used as this package has no dependency on the driver. The following example only installs the CUDA Toolkit 12.8 packages and does not install the driver.

sudo apt-get -y install cuda-toolkit-12-8

cuDNN Installation#

NVIDIA cuDNN can also be installed from the CUDA network repository using Linux package managers by using the libcudnn and libcudnn-dev packages. An example is shown below:

Ubuntu LTS

CUDNN_VERSION=8.1.1.33 \
&& sudo apt-get -y install \
libcudnn8=$CUDNN_VERSION-1+cuda11.2 libcudnn8-dev=$CUDNN_VERSION-1+cuda11.2