Is this page helpful?

Mixed Coherency Environment#

The DGX Station GB300 system exposes both coherent and non-coherent memory in the same machine. Developers must detect this mixed environment and understand its implications for programming and performance.

Grace–Blackwell coherent CPU+GPU memory — hardware-maintained coherence between Grace CPU and Blackwell Ultra GPU, with a single shared address space accessible by both CPU and GPU without explicit cache management.
Traditional PCIe GPU memory — on optional discrete GPU(s) (such as RTX PRO), with system-allocated memory made available to CUDA through HMM.

CUDA Unified Memory#

CUDA Unified Memory provides the programming model for the mixed coherency environment, allowing managed memory to be accessed from code running on either the CPU or the GPU. This works by allocating memory through dedicated CUDA APIs (such as cudaMallocManaged) and letting the CUDA driver and runtime manage the movement of data between CPU and GPU memory as needed.

This is further extended on supported systems by HMM or ATS, which allow the GPU to access memory allocated through the system allocator (such as malloc). HMM provides this behavior through software page-fault mechanisms. ATS provides the hardware path on NVLink-C2C systems, where the GPU can use host page tables and access CPU memory directly. On DGX Station GB300, the GB300 uses ATS to provide a hardware-coherent view of memory, while the PCIe-attached RTX PRO GPU uses HMM to provide a software-coherent view of system-allocated memory.

The GB300 memory path runs in Coherent Driver-based Memory Management (CDMM) mode by default with the R610 or later driver. In CDMM mode the NVIDIA driver manages GPU memory directly, while CPU memory remains managed by the Linux kernel. CDDM is enabled by setting the NVIDIA driver parameter NVreg_CoherentGPUMemoryMode=driver. To disable CDMM and enable the legacy NUMA mode, remove the parameter or set it to NVreg_CoherentGPUMemoryMode=numa. NUMA mode is not recommended.

Coherent Memory#

The two types of onboard memory (HBM on the GPU, LPDDR5X on the CPU) are exposed as a single coherent memory space accessible by both the Grace CPU and the integrated GB300 GPU. This is implemented through the hardware coherence protocol between the CPU and GPU. A caveat to keep in mind, though, is that while the CPU can cache accesses to GB300’s memory, GB300’s access to CPU memory is not cached. Although the onboard memory is physically separate, it is visible through a unified address space. Under the default CDMM mode, GB300 GPU memory is managed by the NVIDIA driver rather than exposed to the OS as generic (NUMA) memory.

From a programmer’s point of view: data in supported mappings can be read and written by CPU and GB300 concurrently and remain consistent. There is no need to perform explicit cache management; the hardware maintains coherence. System memory pointers can be accessed directly by CUDA kernels, but under the default CDMM mode system-allocated memory is not migrated to GB300 GPU memory. [1]

Non-Coherent Memory#

The RTX PRO GPU(s) in the system have their own dedicated memory without a cache snooping mechanism between the GPU and CPU. Thus, RTX PRO device memory is not hardware-coherent with CPU memory. When sharing mappings between the CPU and RTX PRO GPU, the HMM path provides software coherence for system-allocated memory, with performance implications.

Querying for Addressing Mode Support#

To enumerate the GPUs in the system and find their index, run:

nvidia-smi -L

To detect programmatically whether a GPU uses the ATS or HMM addressing mode, use the NVML API:

#include <nvml.h>
#include <stdio.h>

int main(void)
{
nvmlReturn_t               ret;
nvmlDevice_t               device;
nvmlDeviceAddressingMode_t mode            = {.version = nvmlDeviceAddressingMode_v1};
unsigned int               deviceCount     = 0;
char                       deviceName[256] = {0};

printf("NVML Test\n");

ret = nvmlInit();
ret = nvmlDeviceGetCount(&deviceCount);

for(unsigned int d = 0; d < deviceCount; ++d)
{

   ret = nvmlDeviceGetHandleByIndex(d, &device);
   ret = nvmlDeviceGetName(device, deviceName, sizeof(deviceName));
   ret = nvmlDeviceGetAddressingMode(device, &mode);

   const char *modeName =
      mode.value == NVML_DEVICE_ADDRESSING_MODE_ATS ? "ATS" :
      mode.value == NVML_DEVICE_ADDRESSING_MODE_HMM ? "HMM" : "None";
   printf("GPU %u: %s uses addressing mode %s\n", d, deviceName, modeName);
}

nvmlShutdown();
return 0;
}

(Error handling omitted for brevity.) Compile with:

gcc -I/usr/local/cuda/include/ -L/usr/local/cuda/lib64/stubs nvmltest.cpp -l nvidia-ml -o nvmltest

This produces output similar to the following:

user@localhost:~$ ./nvmltest
NVML Test
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition uses addressing mode HMM
GPU 1: NVIDIA GB300 uses addressing mode ATS

Note

The GPU 0 and GPU 1 labels above are NVML device indices. They are not necessarily the same as CUDA device ordinals, which can be remapped by CUDA_VISIBLE_DEVICES (see GPU Selection in the Mixed Coherency Environment below). Do not assume the NVML index equals the CUDA ordinal when selecting a GPU.

UVM Pitfalls#

Though the DGX Station GB300 mixed memory environment offers substantial memory accessible by both CPU and GB300 GPU, understand the performance implications of using UVM. The GB300 GPU and CPU each have dedicated physical memory; when crossing the boundary between them, bandwidth is limited to what NVLink can provide. It is therefore recommended to use UVM only for ease of programming and when the working set size exceeds the capacity of the GB300 GPU memory. Avoid using UVM as a default for all memory allocations. For highest performance, allocate memory explicitly in the appropriate memory space (such as cudaMalloc for GB300 GPU memory, malloc for CPU memory) and manage data movement explicitly.

Under the default CDMM mode, GB300 GPU memory is driver-managed and malloc allocations remain in CPU memory. (In the legacy NUMA mode, malloc allocations may instead land in GB300 GPU memory.)

GPU Selection in the Mixed Coherency Environment#

When the installed CUDA and driver release supports mixed coherency, CUDA can access both the GB300 and RTX PRO GPUs in the same process. To control which GPUs are visible to an application, set the CUDA_VISIBLE_DEVICES environment variable. Do not assume CUDA_VISIBLE_DEVICES=0 selects the GB300; nvidia-smi enumeration order varies by hardware SKU. The OS defaults described below pin stable CUDA device ordering at boot.

For more on CUDA_VISIBLE_DEVICES, see https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/

CUDA provides an additional environment variable CUDA_DEVICE_MODALITY which can be used to select either the ATS-capable GPU or the non-ATS-capable GPU. Valid values for CUDA_DEVICE_MODALITY are ATS and NONATS.

Device selection for all NVIDIA drivers can also be done using application profile keys. The NVIDIA driver installer puts a default profile in /usr/share/nvidia/nvidia-application-profiles-[verMaj.verMin]-rc and key documentation in /usr/share/nvidia/nvidia-application-profiles-[verMaj.verMin]-key-documentation.

The relevant app profile key for selecting a GPU on the DGX Station is DeviceModalityPreference; valid values are ATS, NONATS. /usr/share/nvidia/nvidia-application-profiles-[verMaj.verMin]-rc shows in the profiles section:

"profiles" : [

{ "name": "UseATSGpuInMixedCoherencySystems", "settings": [ "DeviceModalityPreference", 1 ] },
{ "name": "UseNonATSGpuInMixedCoherencySystems", "settings": [ "DeviceModalityPreference", 2 ] }
]

Apps can then select the right profile using the rules section:

"rules" : [

   { "pattern": { "feature":"cmdline", "matches": "uvmConformance" } , "profile": "UseNonATSGpuInMixedCoherencySystems" }
]

System level specifications for keys can be placed in /etc/nvidia/nvidia-application-profiles-rc while user level specifications can be placed in ~/.nv/nvidia-application-profiles-rc

NOTE: User-level specifications override both system and NVIDIA driver defaults; system-level specifications override NVIDIA driver defaults.

DGX Station OS Defaults for Mixed Coherency#

To make typical mixed-coherency workflows work out of the box, the DGX Station ships with two defaults that are applied at every boot:

A boot-time service pins CUDA_VISIBLE_DEVICES so the on-board GB300 GPU is always at CUDA position 0, and – if a discrete RTX PRO add-in card is installed – it is always at CUDA position 1.
A system-level NVIDIA application profile routes any process loading the GLX library to the non-ATS GPU (the RTX PRO).

The net effect is that CUDA-only workloads default to the GB300, graphical (GLX-using) applications default to the RTX PRO, and frameworks that default to cuda:0 land on the GB300 on every boot, regardless of the underlying nvidia-smi enumeration order.

OS-level default for `CUDA_VISIBLE_DEVICES`#

A systemd oneshot (mixed-coherency-gpu-select.service) runs at boot, queries nvidia-smi for the indices of the GB300 and any RTX card, and writes the resulting environment to /etc/mixed-coherency-gpu-select/env, /etc/environment, and a profile.d snippet so both PAM logins and login shells inherit the values:

CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VISIBLE_DEVICES=<nvidia-smi index of GB300>
# or, when an RTX PRO add-in card is installed:
CUDA_VISIBLE_DEVICES=<nvidia-smi index of GB300>,<nvidia-smi index of RTX PRO 6000>

To verify the service ran and inspect the values it produced:

systemctl status mixed-coherency-gpu-select.service
cat /etc/mixed-coherency-gpu-select/env
echo "$CUDA_VISIBLE_DEVICES"   # in a fresh login shell

This service is part of the default OS, Ubuntu with NVIDIA AI Developer Tools; no manual installation is required.

Default application profile for GLX applications#

The DGX Station also ships a system-level NVIDIA application profile at /etc/nvidia/nvidia-application-profiles-rc that selects the non-ATS GPU for any process that loads libGLX.so.0:

{
    "rules": [
        {
            "pattern": {
                "feature": "dso",
                "matches": "libGLX.so.0"
            },
            "profile": "UseNonATSGpuInMixedCoherencySystems"
        }
    ]
}

The profile UseNonATSGpuInMixedCoherencySystems is one of the profiles defined in the NVIDIA driver’s default profiles file (see the example earlier in this section); it sets DeviceModalityPreference to NONATS. The result is that desktop GL applications, Vulkan applications using GLX for window-system integration, and similar GLX-loading workloads default to the RTX PRO – without affecting CUDA workloads, which continue to default to the GB300 via CUDA_VISIBLE_DEVICES.

Overriding the defaults#

Both defaults are environment- and profile-driven, and can be overridden per-application or per-user without disabling the OS-level recipes:

To override CUDA_VISIBLE_DEVICES for a single launch, export it before running the application; the explicit value wins over the system-wide default written to /etc/environment.
To override the application-profile selection, place a user-level profile at ~/.nv/nvidia-application-profiles-rc. As described above, user-level specifications override both system and NVIDIA driver defaults.

Containers / NVIDIA AI Workbench#

The OS default for CUDA_VISIBLE_DEVICES does not affect containers. The values written by mixed-coherency-gpu-select.service reach only host PAM logins and login shells – via /etc/environment (read by PAM) and the /etc/profile.d/ snippet (sourced by login shells). Containers do not inherit them:

A container has its own filesystem – its own /etc/environment and /etc/profile.d/ – so neither host file is present inside it.
The container daemon (dockerd / containerd / podman) is a systemd service, and systemd services do not read /etc/environment (that file is a PAM feature, not a systemd one). Docker/Podman then give each container a clean environment built from the image plus explicit -e/--env flags – not the launching shell’s ambient environment. NVIDIA AI Workbench starts its project containers this way, so the host value never crosses the boundary.

Note

Even if you forwarded the host value into the container, it would be the wrong knob. Inside a container, which physical GPUs exist is governed by the NVIDIA Container Toolkit via NVIDIA_VISIBLE_DEVICES (it mounts the selected device nodes). CUDA_VISIBLE_DEVICES is only a CUDA-library filter applied within a process, over whatever devices are already present.

The container re-indexes GPUs from 0. The host list the OS default produces (for example, 0,1 = GB300 then RTX, ordered by PCI_BUS_ID) is meaningless inside the container, which only sees the subset the runtime exposed. Copying the host value in could even select the wrong device.

For toolkit background, see the NVIDIA Container Toolkit guide on CUDA_VISIBLE_DEVICES with containers.

What to do for AI Workbench#

Reproduce the OS default intent – both GPUs visible, GB300 first – at the container layer, in two steps:

Expose both physical GPUs with NVIDIA_VISIBLE_DEVICES (the NVIDIA Container Toolkit knob that mounts the device nodes into the container): NVIDIA_VISIBLE_DEVICES=all, or list them explicitly by UUID NVIDIA_VISIBLE_DEVICES=GPU-<gb300>,GPU-<rtx>.
Force the in-container order – GB300 as cuda:0, RTX as cuda:1 – with CUDA_VISIBLE_DEVICES, listing by UUID: CUDA_VISIBLE_DEVICES=GPU-<gb300>,GPU-<rtx> (optionally CUDA_DEVICE_ORDER=PCI_BUS_ID).

Listing CUDA_VISIBLE_DEVICES by UUID is the key: a UUID names an absolute GPU, so it sidesteps the “container re-indexes from 0” problem entirely – the order you write is the order CUDA reports, regardless of the container’s device numbers or the host indices. GB300 first means torch.device("cuda") / cuda:0 lands on the GB300, and the RTX is cuda:1.

Get both UUIDs (GB300 first, then RTX):

nvidia-smi --query-gpu=name,uuid --format=csv,noheader
# NVIDIA GB300, GPU-aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
# NVIDIA RTX PRO 6000 ..., GPU-bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb

How to inject the variables into a Workbench project#

Containers have an isolated environment and do not automatically inherit host-level environment variables (even those in /etc/environment or ~/.bashrc), so declare them on the project. Two ways:

Option 1 – Workbench UI (easiest). In your project, go to Environment > Variables and add each variable (NVIDIA_VISIBLE_DEVICES, CUDA_VISIBLE_DEVICES, and optionally CUDA_DEVICE_ORDER). Workbench injects these into the container at startup.

Option 2 – edit .project/spec.yaml directly, under spec.resources.variables:

spec:
  resources:
    variables:
      # expose both GPUs to the container
      - name: NVIDIA_VISIBLE_DEVICES
        value: "all"
      # order them GB300 first (cuda:0), RTX second (cuda:1), by UUID
      - name: CUDA_VISIBLE_DEVICES
        value: "GPU-aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,GPU-bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb"
      - name: CUDA_DEVICE_ORDER
        value: "PCI_BUS_ID"

Replace the two GPU-... UUIDs with the GB300 and RTX values from the command above. To expose only the GB300 instead, drop the RTX UUID from both variables.

Vulkan-CUDA Interoperability#

Since one may want to run compute-heavy tasks on the GB300, while running Vulkan on the RTX PRO GPU, it is desirable to be able to exchange data between the two.

The GB300 and RTX PRO GPUs do not share the same device memory. Instead, data will have to be copied from one to the other. CUDA’s memory transfer APIs can be used to achieve this in an optimal manner. If CUDA was initialized before Vulkan and bound to the GB300 first, Vulkan might fail to initialize on the RTX PRO GPU. Look out for VK_ERROR_INITIALIZATION_FAILED returned by vkCreateDevice().

To exchange data between a Vulkan instance running on the RTX PRO and a CUDA context running on the GB300, the following steps can be taken:

Allocate and export memory from Vulkan (RTX PRO). Create a buffer with VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT, allocate device memory with VkExportMemoryAllocateInfo in the pNext chain, then obtain a file descriptor with vkGetMemoryFdKHR.
Import that memory into CUDA on the RTX PRO. Use the CUDA driver API cuImportExternalMemory with CU_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD and the fd from Vulkan, then cuExternalMemoryGetMappedBuffer to get a device pointer valid in the RTX PRO CUDA context.
Copy from RTX PRO to GB300. Allocate the destination buffer on the GB300 with cudaMalloc (with the current device set to the GB300), then use cudaMemcpy[Async] to transfer from the RTX PRO device pointer to the GB300 device pointer. Peer-to-peer transfer APIs are not supported between the two GPUs.
Use CUDA events to synchronize the streams on both GPUs.

The following snippet assumes both GPUs are visible to CUDA using the OS default ordering, where the GB300 is CUDA device 0 and the RTX PRO is CUDA device 1. Vulkan is initialized on the RTX PRO; the fd is obtained from Vulkan and omitted here.

/* 1) Vulkan (on RTX PRO): create buffer with VkExternalMemoryBufferCreateInfo
 *    (handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT), allocate
 *    memory with VkExportMemoryAllocateInfo, bind, then:
 *    vkGetMemoryFdKHR(device, &(VkMemoryGetFdInfoKHR){ .memory = vkMem, ... }, &fd);
 */

int fd;   /* from vkGetMemoryFdKHR */
size_t size = 1024 * 1024;

/* 2) Import into CUDA on the RTX PRO (device 1) */
const int GB300_DEVICE   = 0;
const int RTX_PRO_DEVICE = 1;

cudaSetDevice(RTX_PRO_DEVICE);
CUexternalMemory extMem;
CU_EXTERNAL_MEMORY_HANDLE_DESC handleDesc = {
    .type = CU_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD,
    .handle.fd = fd
};
cuImportExternalMemory(&extMem, &handleDesc);

CU_EXTERNAL_MEMORY_BUFFER_DESC bufferDesc = { .offset = 0, .size = size };
CUdeviceptr d_rtx;
cuExternalMemoryGetMappedBuffer(&d_rtx, extMem, &bufferDesc);

/* 3) Allocate on GB300 and copy across GPUs (RTX PRO -> GB300) */
cudaSetDevice(GB300_DEVICE);
void* d_GB300 = NULL;
cudaMalloc(&d_GB300, size);

cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync(d_GB300,
                (const void*)d_rtx,
                size,
                cudaMemcpyDeviceToDevice,
                stream);