Device Nodes and Capabilities#
Currently, the NVIDIA kernel driver exposes its interfaces through a few system-wide device nodes. Each physical GPU
is represented by its own device node - for example, nvidia0, nvidia1 and so on. This is shown below for a 2-GPU system.
/dev
├── nvidiactl
├── nvidia-modeset
├── nvidia-uvm
├── nvidia-uvm-tools
├── nvidia-nvswitchctl
├── nvidia0
└── nvidia1
Starting with CUDA 11/R450, a new abstraction known as nvidia-capabilities has been introduced. The idea being that access
to a specific capability is required to perform certain actions through the driver. If a user has access to the capability,
the action will be carried out. If a user does not have access to the capability, the action will fail. The one exception
being if you are the root-user (or any user with CAP_SYS_ADMIN privileges). With CAP_SYS_ADMIN privileges, you
implicitly have access to all nvidia-capabilities.
For example, the mig-config capability allows one to create and destroy MIG instances on any MIG-capable GPU (for example,
the A100 GPU). Without this capability, all attempts to create or destroy a MIG instance will fail. Likewise, the fabric-mgmt
capability allows one to run the Fabric Manager as a non-root but privileged daemon. Without this capability, all attempts
to launch the Fabric Manager as a non-root user will fail.
The following sections walk through the system level interface for managing these new nvidia-capabilities, including the
steps necessary to grant and revoke access to them.
System Level Interface#
There are two different system-level interfaces available to work with nvidia-capabilities. The first is via /dev and
the second is via /proc. The /proc based interface relies on user-permissions and mount namespaces to limit access to
a particular capability, while the /dev based interface relies on cgroups. Technically, the /dev based interface
also relies on user-permissions as a second-level access control mechanism (on the actual device node files themselves), but
the primary access control mechanism is cgroups. The current CUDA 11/R450 GA (Linux driver 450.51.06) supports both mechanisms,
but going forward the /dev based interface is the preferred method and the /proc based interface is deprecated. For now, users
can choose the desired interface by using the nv_cap_enable_devfs parameter on the nvidia.ko kernel module:
When
nv_cap_enable_devfs=0, the/procbased interface is enabled.When
nv_cap_enable_devfs=1, the/devbased interface is enabled.A setting of
nv_cap_enable_devfs=0is the default for the R450 driver (as of Linux 450.51.06).All future NVIDIA datacenter drivers will have a default of
nv_cap_enable_devfs=1.
The following is an example of loading the nvidia kernel module with this parameter set:
$ modprobe nvidia nv_cap_enable_devfs=1
/dev Based nvidia-capabilities#
The system level interface for interacting with /dev based capabilities is actually through a combination of /proc and /dev.
First, a new major device is now associated with nvidia-caps and can be read from the standard /proc/devices file.
$ cat /proc/devices | grep nvidia-caps
508 nvidia-caps
Second, the exact same set of files exist under /proc/driver/nvidia/capabilities. These files no longer control access to
the capability directly and instead, the contents of these files point at a device node under /dev, through which cgroups
can be used to control access to the capability.
This can be seen in the following example:
$ cat /proc/driver/nvidia/capabilities/mig/config
DeviceFileMinor: 1
DeviceFileMode: 256
DeviceFileModify: 1
The combination of the device major for nvidia-caps and the value of DeviceFileMinor in this file indicate that
the mig-config capability (which allows a user to create and destroy MIG devices) is controlled by the device node with
a major:minor of 238:1. As such, one will need to use cgroups to grant a process read access to this device in
order to configure MIG devices. The purpose of the DeviceFileMode and DeviceFileModify fields in this file are explained
later on in this section.
The standard location for these device nodes is under /dev/nvidia-caps:
$ ls -l /dev/nvidia-caps
total 0
cr-------- 1 root root 508, 1 Nov 21 17:16 nvidia-cap1
cr--r--r-- 1 root root 508, 2 Nov 21 17:16 nvidia-cap2
...
Unfortunately, these device nodes cannot be automatically created/deleted by the NVIDIA driver at the same time it creates/deletes
files under /proc/driver/nvidia/capabilities (due to GPL compliance issues). Instead, a user-level program called nvidia-modprobe
is provided, that can be invoked from user-space in order to do this. For example:
$ nvidia-modprobe \
-f /proc/driver/nvidia/capabilities/mig/config \
-f /proc/driver/nvidia/capabilities/mig/monitor
$ ls -l /dev/nvidia-caps
total 0
cr-------- 1 root root 508, 1 Nov 21 17:16 nvidia-cap1
cr--r--r-- 1 root root 508, 2 Nov 21 17:16 nvidia-cap2
nvidia-modprobe looks at the DeviceFileMode in each capability file and creates the device node with the permissions
indicated (for example, +ur from a value of 256 (o400) from our example for mig-config).
Programs such as nvidia-smi will automatically invoke nvidia-modprobe (when available) to create these device nodes
on your behalf. In other scenarios it is not necessarily required to use nvidia-modprobe to create these device nodes, but
it does make the process simpler.
If you actually want to prevent nvidia-modprobe from ever creating a particular device node on your behalf, you can do the following:
# Give a user write permissions to the capability file under /proc
$ chmod +uw /proc/driver/nvidia/capabilities/mig/config
# Update the file with a "DeviceFileModify" setting of 0
$ echo "DeviceFileModify: 0" > /proc/driver/nvidia/capabilities/mig/config
You will then be responsible for managing creation of the device node referenced by /proc/driver/nvidia/capabilities/mig/config
going forward. If you want to change that in the future, simply reset it to a value of DeviceFileModify: 1 with the same command sequence.
This is important in the context of containers because we may want to give a container access to a certain capability even
if it doesn’t exist in the /proc hierarchy yet.
For example, granting a container the mig-config capability implies that we should also grant it capabilities to access all
possible gis and cis that could be created for any GPU on the system. Otherwise the container will have no way of working with
those gis and cis once they have actually been created.
One final thing to note about /dev based capabilities is that the minor numbers for all possible capabilities are
predetermined and can be queried under various files of the form:
/proc/driver/nvidia-caps/*-minors
For example, all capabilities related to MIG can be looked up as:
$ cat /proc/driver/nvidia-caps/mig-minors
config 1
monitor 2
gpu0/gi0/access 3
gpu0/gi0/ci0/access 4
gpu0/gi0/ci1/access 5
gpu0/gi0/ci2/access 6
...
gpu31/gi14/ci6/access 4321
gpu31/gi14/ci7/access 4322
The format of the content is: GPU<deviceMinor>/gi<GPU instance ID>/ci<compute instance ID>
Note that the GPU device minor number can be obtained by using either of these mechanisms:
The NVML API
nvmlDeviceGetMinorNumber()so it returns the device minor numberOr use the PCI BDF available under
/proc/driver/nvidia/gpus/domain:bus:device:function/information. This file contains a “Device Minor” field.
Note
The NVML device numbering (such as through nvidia-smi) is not the device minor number.
For example, if the MIG geometry was created as below:
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 1 0 0 | 19MiB / 40192MiB | 14 0 | 3 0 3 0 3 |
| | 0MiB / 65535MiB | | |
+------------------+ +-----------+-----------------------+
| 0 1 1 1 | | 14 0 | 3 0 3 0 3 |
| | | | |
+------------------+ +-----------+-----------------------+
| 0 1 2 2 | | 14 0 | 3 0 3 0 3 |
| | | | |
+------------------+----------------------+-----------+-----------------------+
Then the corresponding device nodes: /dev/nvidia-cap12, /dev/nvidia-cap13, /dev/nvidia-cap14, and /dev/nvidia-cap15 would be created.
/proc based nvidia-capabilities (Deprecated)#
The system level interface for interacting with /proc based nvidia-capabilities is rooted at /proc/driver/nvidia/capabilities.
Files underneath this hierarchy are used to represent each capability, with read access to these files controlling whether a user
has a given capability or not. These files have no content and only exist to represent a given capability.
For example, the mig-config capability (which allows a user to create and destroy MIG devices) is represented as follows:
/proc/driver/nvidia/capabilities
└── mig
└── config
Likewise, the capabilities required to run workloads on a MIG device once it has been created are represented as follows (namely as access to the GPU Instance and Compute Instance that comprise the MIG device):
/proc/driver/nvidia/capabilities
└── gpu0
└── mig
├── gi0
│ ├── access
│ └── ci0
│ └── access
├── gi1
│ ├── access
│ └── ci0
│ └── access
└── gi2
├── access
└── ci0
└── access
And the corresponding file system layout is shown below with read permissions:
$ ls -l /proc/driver/nvidia/capabilities/gpu0/mig/gi*
/proc/driver/nvidia/capabilities/gpu0/mig/gi1:
total 0
-r--r--r-- 1 root root 0 May 24 17:38 access
dr-xr-xr-x 2 root root 0 May 24 17:38 ci0
/proc/driver/nvidia/capabilities/gpu0/mig/gi2:
total 0
-r--r--r-- 1 root root 0 May 24 17:38 access
dr-xr-xr-x 2 root root 0 May 24 17:38 ci0
For a CUDA process to be able to run on top of MIG, it needs access to the Compute Instance capability and its parent GPU Instance. Thus a MIG device is identified by the following format:
MIG-<GPU-UUID>/<GPU instance ID>/<compute instance ID>
As an example, having read access to the following paths would allow one to run workloads on
the MIG device represented by <gpu0, gi0, ci0>:
/proc/driver/nvidia/capabilities/gpu0/mig/gi0/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi0/ci0/access
Note that there is no access file representing a capability to run workloads on gpu0 (only on gi0 and ci0 that sit underneath gpu0). This is because the traditional mechanism of using cgroups to control access to top level GPU devices (and any required meta devices) is still required. As shown earlier in the document, the cgroups mechanism applies to:
/dev/nvidia0
/dev/nvidiactl
/dev/nvidiactl-uvm
...
In the context of containers, a new mount namespace should be overlaid on top of the path for /proc/driver/nvidia/capabilities,
and only those capabilities a user wishes to grant to a container should be bind-mounted in. Since the host’s
user/group information is retained across the bind-mount, it must be ensured that the correct user permissions are
set for these capabilities on the host before injecting them into a container.