IMEX Channels#

IMEX channels are a GPU driver feature that allows for user-based memory isolation in a multi-user environment within an IMEX domain.

The GPU driver implements the IMEX channels by registering nvidia-caps-imex-channels, a new chararacter device. Users are expected to create the IMEX channels, as they are not created by default. Failure to do so will result in dependent CUDA APIs failing with an insufficient permission error.

Note: IMEX channels will not persist through a reboot, and will need to be recreated.

Default Channel Creation#

The creation of the /dev/ nodes /dev/nvidia-caps-imex-channels/channelN must be handled by the administrator (for example, using mknod), where N is the minor number.

Here is an example of how to query the major number from /proc/devices. The major number can differ between nodes, so the 234 number below is just an example.

$ cat /proc/devices | grep nvidia-caps-imex-channels
234 nvidia-caps-imex-channels
  1. Run the following commands to create the default IMEX channel /dev/ node.

    $ sudo mkdir /dev/nvidia-caps-imex-channels/
    $ sudo mknod /dev/nvidia-caps-imex-channels/channel0 c <major number> 0 
    
  2. Check that the node has been created successfully.

    $ sudo ll /dev/nvidia-caps-imex-channels/
    total 0
    drwxr-xr-x  2 root root     80 Feb 13 19:46 ./
    drwxr-xr-x 21 root root   4640 Feb 14 11:43 ../
    crw-rw-rw-  1 root root 234, 0 Feb 13 19:46 channel0
    

The default node has to be created each time the node reboots. To simplify the setup steps when both applications share the same channel (for example, in a single-user environment), the driver provides the NVreg_CreateImexChannel0 module parameter. If set, the IMEX channel0 will be created automatically when the NVIDIA Open GPU Kernel module is loaded.

This module parameter must be specified when configuring the GPU hardware settings.

Here are the possible values for NVreg_CreateImexChannel0:

  • 0: Do not create IMEX channel 0 (default)

  • 1: Create IMEX channel 0

Multi-user channel usage#

To use IMEX channels in multi-user mode, each user must be allocated a /dev/nvidia-caps-imex-channels/channelN device node, and have that device node be assigned to that user by assigning ownership with cgroups or filesystem permissions. When the GPU driver needs to find the channel associated with a user, it will find the lowest channel accessible by the user and use that for all operations. So if you want to ensure isolation, each user should have access to exactly 1 channelN device.

Example of a misconfigured environment:

$ sudo ll /dev/nvidia-caps-imex-channels/
total 0
drwxr-xr-x  2 root  root     80 Feb 13 19:46 ./
drwxr-xr-x 21 root  root   4640 Feb 14 11:43 ../
crw-rw-rw-  1 root  root 234, 0 Feb 13 19:46 channel0
crw-------  1 user1 grp1 234, 1 Feb 13 19:46 channel1
crw-------  1 user2 grp2 234, 2 Feb 13 19:46 channel2

In this example, even though user1 and user2 have been allocated channel1 and channel2 respectively, they both have access to channel0, and will both use that channel assignment instead, and not be isolated from each other.

To fix this, either removing channel0 or removing access for user1 and user2 would be sufficient:

$ sudo ll /dev/nvidia-caps-imex-channels/
total 0
drwxr-xr-x  2 root  root     80 Feb 13 19:46 ./
drwxr-xr-x 21 root  root   4640 Feb 14 11:43 ../
crw-------  1 user1 grp1 234, 1 Feb 13 19:46 channel1
crw-------  1 user2 grp2 234, 2 Feb 13 19:46 channel2

Multi-node channels#

To use IMEX channel-based isolation in a multi-node NVLink environment, each compute node will need to have the same channel allocations for each user on that node.

As an example, lets take a 3-node compute set up, with 2 users running on partially overlapping compute environments

Node ID

users running

Node 0

user1

Node 1

user1, user2

Node 2

user2

Then a properly configured isolated set up could look like:

  • Node 0:

    $ sudo ll /dev/nvidia-caps-imex-channels/
    total 0
    drwxr-xr-x  2 root  root     80 Feb 13 19:46 ./
    drwxr-xr-x 21 root  root   4640 Feb 14 11:43 ../
    crw-------  1 user1 grp1 234, 1 Feb 13 19:46 channel1
    
  • Node 1:

    $ sudo ll /dev/nvidia-caps-imex-channels/
    total 0
    drwxr-xr-x  2 root  root     80 Feb 13 19:46 ./
    drwxr-xr-x 21 root  root   4640 Feb 14 11:43 ../
    crw-------  1 user1 grp1 236, 1 Feb 13 19:46 channel1
    crw-------  1 user2 grp2 236, 2 Feb 13 19:46 channel2
    
  • Node 2:

    $ sudo ll /dev/nvidia-caps-imex-channels/
    total 0
    drwxr-xr-x  2 root  root     80 Feb 13 19:46 ./
    drwxr-xr-x 21 root  root   4640 Feb 14 11:43 ../
    crw-------  1 user2 grp2 235, 2 Feb 13 19:46 channel2
    

In this configuration, user1 will have access to channel1 on each node it is running on, and be isolated from user2, who has access to channel2.

Note: Misconfiguration of these channels and user assignments will result in ILLEGAL STATE errors on multi-node memory imports.

We recommend automating the management of user and channel allocation/assignment within a NVLink cluster.