IMEX Channels#
IMEX channels are a GPU driver feature that allows for user-based memory isolation in a multi-user environment within an IMEX domain.
The GPU driver implements the IMEX channels by registering nvidia-caps-imex-channels
, a new chararacter device. Users are expected to create the IMEX channels, as they are not created by default. Failure to do so will result in dependent CUDA APIs failing with an insufficient permission
error.
Note: IMEX channels will not persist through a reboot, and will need to be recreated.
Default Channel Creation#
The creation of the /dev/
nodes /dev/nvidia-caps-imex-channels/channelN
must be handled by the administrator (for example, using mknod), where N is the minor number.
Here is an example of how to query the major number from /proc/devices
. The major number can differ between nodes, so the 234 number below is just an example.
$ cat /proc/devices | grep nvidia-caps-imex-channels
234 nvidia-caps-imex-channels
Run the following commands to create the default IMEX channel
/dev/
node.$ sudo mkdir /dev/nvidia-caps-imex-channels/ $ sudo mknod /dev/nvidia-caps-imex-channels/channel0 c <major number> 0
Check that the node has been created successfully.
$ sudo ll /dev/nvidia-caps-imex-channels/ total 0 drwxr-xr-x 2 root root 80 Feb 13 19:46 ./ drwxr-xr-x 21 root root 4640 Feb 14 11:43 ../ crw-rw-rw- 1 root root 234, 0 Feb 13 19:46 channel0
The default node has to be created each time the node reboots. To simplify the setup steps when both applications share the same channel (for example, in a single-user environment), the driver provides the NVreg_CreateImexChannel0
module parameter. If set, the IMEX channel0
will be created automatically when the NVIDIA Open GPU Kernel module is loaded.
This module parameter must be specified when configuring the GPU hardware settings.
Here are the possible values for NVreg_CreateImexChannel0
:
0: Do not create IMEX channel 0 (default)
1: Create IMEX channel 0
Multi-user channel usage#
To use IMEX channels in multi-user mode, each user must be allocated a /dev/nvidia-caps-imex-channels/channelN
device node, and have that device node be assigned to that user by assigning ownership with cgroups or filesystem permissions. When the GPU driver needs to find the channel associated with a user, it will find the lowest channel accessible by the user and use that for all operations. So if you want to ensure isolation, each user should have access to exactly 1 channelN
device.
Example of a misconfigured environment:
$ sudo ll /dev/nvidia-caps-imex-channels/
total 0
drwxr-xr-x 2 root root 80 Feb 13 19:46 ./
drwxr-xr-x 21 root root 4640 Feb 14 11:43 ../
crw-rw-rw- 1 root root 234, 0 Feb 13 19:46 channel0
crw------- 1 user1 grp1 234, 1 Feb 13 19:46 channel1
crw------- 1 user2 grp2 234, 2 Feb 13 19:46 channel2
In this example, even though user1 and user2 have been allocated channel1 and channel2 respectively, they both have access to channel0, and will both use that channel assignment instead, and not be isolated from each other.
To fix this, either removing channel0 or removing access for user1 and user2 would be sufficient:
$ sudo ll /dev/nvidia-caps-imex-channels/
total 0
drwxr-xr-x 2 root root 80 Feb 13 19:46 ./
drwxr-xr-x 21 root root 4640 Feb 14 11:43 ../
crw------- 1 user1 grp1 234, 1 Feb 13 19:46 channel1
crw------- 1 user2 grp2 234, 2 Feb 13 19:46 channel2
Multi-node channels#
To use IMEX channel-based isolation in a multi-node NVLink environment, each compute node will need to have the same channel allocations for each user on that node.
As an example, lets take a 3-node compute set up, with 2 users running on partially overlapping compute environments
Node ID |
users running |
---|---|
Node 0 |
user1 |
Node 1 |
user1, user2 |
Node 2 |
user2 |
Then a properly configured isolated set up could look like:
Node 0:
$ sudo ll /dev/nvidia-caps-imex-channels/ total 0 drwxr-xr-x 2 root root 80 Feb 13 19:46 ./ drwxr-xr-x 21 root root 4640 Feb 14 11:43 ../ crw------- 1 user1 grp1 234, 1 Feb 13 19:46 channel1
Node 1:
$ sudo ll /dev/nvidia-caps-imex-channels/ total 0 drwxr-xr-x 2 root root 80 Feb 13 19:46 ./ drwxr-xr-x 21 root root 4640 Feb 14 11:43 ../ crw------- 1 user1 grp1 236, 1 Feb 13 19:46 channel1 crw------- 1 user2 grp2 236, 2 Feb 13 19:46 channel2
Node 2:
$ sudo ll /dev/nvidia-caps-imex-channels/ total 0 drwxr-xr-x 2 root root 80 Feb 13 19:46 ./ drwxr-xr-x 21 root root 4640 Feb 14 11:43 ../ crw------- 1 user2 grp2 235, 2 Feb 13 19:46 channel2
In this configuration, user1 will have access to channel1 on each node it is running on, and be isolated from user2, who has access to channel2.
Note: Misconfiguration of these channels and user assignments will result in
ILLEGAL STATE
errors on multi-node memory imports.
We recommend automating the management of user and channel allocation/assignment within a NVLink cluster.