Mixed Coherency Environment#
The DGX Station B300 system exposes both coherent and non-coherent memory in the same machine. Developers must detect this mixed environment and understand its implications for programming and performance.
Grace–Blackwell coherent CPU+GPU memory — hardware-maintained coherence between Grace CPU and Blackwell Ultra GPU, with a single shared address space accessible by both CPU and GPU without explicit cache management.
Traditional non-coherent GPU memory — on optional discrete GPU(s) (such as RTX PRO) connected through PCIe.
Coherent Memory#
The two types of onboard memory (HBM on the GPU, LPDDR5X on the CPU) are exposed as a single coherent memory space accessible by both the Grace CPU and the integrated B300 GPU. This is implemented via a cache snooping protocol between the CPU and GPU. A caveat to keep in mind, though, is that while the CPU can cache accesses to B300’s memory, B300’s access to CPU memory is not cached. Although the onboard memory is physically separate, it is logically unified: both address spaces are added to the system memory pool and usable as regular program memory.
From a programmer’s point of view: data in that pool can be read and written by CPU and GPU concurrently and remain consistent. There is no need to perform explicit cache management; the hardware maintains coherence. System memory pointers can directly be accessed by the CUDA kernels.
Non-Coherent Memory#
The RTX PRO GPU(s) in the system have their own dedicated memory without a cache snooping mechanism between the GPU and CPU. Thus, both memories are non-coherent. When sharing mappings between the CPU and RTX PRO GPU, the developer typically either performs manual cache maintenance (flush/invalidate) or creates these mappings as non-cacheable, with performance implications.
CUDA Unified Virtual Memory#
CUDA Unified Virtual Memory (UVM) is a programming model that promises a single shared memory address space between all processors in a system.
This works by allocating memory through dedicated CUDA APIs (such as cudaMallocManaged) and letting the CUDA driver and runtime manage the movement of data between CPU and GPU memory as needed.
This is further extended by Heterogeneous Memory Management (HMM), which allows the GPU to also access memory allocated through the system allocator (such as malloc).
On non-coherent GPUs, this is implemented in software through page fault mechanisms that mirror the CPU page tables on the GPU. On coherent GPUs,
it is implemented through hardware support in the form of Address Translation Services (ATS), where the GPU can walk the CPU page tables and access CPU memory directly.
Querying for Coherency-Mode Support#
To enumerate the GPUs in the system and find their index, run:
nvidia-smi -L
To detect programmatically whether a GPU has HMM support, use the NVML API:
#include <nvml.h>
#include <stdio.h>
int main(void)
{
nvmlReturn_t ret;
nvmlDevice_t device;
nvmlDeviceAddressingMode_t mode = {.version = nvmlDeviceAddressingMode_v1};
unsigned int deviceCount = 0;
char deviceName[256] = {0};
printf("NVML Test\n");
ret = nvmlInit();
ret = nvmlDeviceGetCount(&deviceCount);
for(unsigned int d = 0; d < deviceCount; ++d)
{
ret = nvmlDeviceGetHandleByIndex(d, &device);
ret = nvmlDeviceGetName(device, deviceName, sizeof(deviceName));
ret = nvmlDeviceGetAddressingMode(device, &mode);
int coherent = (mode.value == NVML_DEVICE_ADDRESSING_MODE_ATS ||
mode.value == NVML_DEVICE_ADDRESSING_MODE_HMM);
printf("GPU %u: %s is: %s (mode %s)\n", d, deviceName,
coherent ? "coherent" : "non-coherent",
mode.value == NVML_DEVICE_ADDRESSING_MODE_ATS ? "ATS" : "HMM");
}
nvmlShutdown();
return 0;
}
(Error handling omitted for brevity.) Compile with:
gcc -I/usr/local/cuda/include/ -L/usr/local/cuda/lib64/stubs nvmltest.cpp -l nvidia-ml -o nvmltest
This produces output similar to the following:
user@localhost:~$ ./nvmltest
NVML Test
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition is: coherent (mode HMM)
GPU 1: NVIDIA B300 is: coherent (mode ATS)
UVM Pitfalls#
Though the DGX Station B300 NUMA environment offers substantial memory accessible by both CPU and B300 GPU, understand the performance implications of using UVM.
The B300 GPU and CPU each have dedicated physical memory; when crossing the boundary between them, bandwidth is limited to what NVLink can provide.
It is therefore recommended to use UVM only for ease of programming and when the working set size exceeds the capacity of the B300 GPU memory.
Avoid using UVM as a default for all memory allocations. For highest performance, allocate memory explicitly in the appropriate memory space
(such as cudaMalloc for B300 GPU memory, malloc for CPU memory) and manage data movement explicitly.
Since B300 GPU memory is part of the system memory pool, allocations through malloc may inadvertently end up in B300 GPU memory. Use numactl to control the memory allocation policy.
Current Limitations of the Mixed Coherency Environment#
Currently, CUDA cannot handle mixed coherency GPUs in the same process, thus only one of the GPUs (B300 or RTX PRO) in the system can be accessed through CUDA in a process.
To select the GPU to use, set the CUDA_VISIBLE_DEVICES environment variable.
export CUDA_VISIBLE_DEVICES=0 selects the B300 GPU.
For more on CUDA_VISIBLE_DEVICES, see https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/
CUDA provides an additional environment variable CUDA_DEVICE_MODALITY which can be used to select either the ATS-capable GPU or the non-ATS-capable GPU. Valid values for CUDA_DEVICE_MODALITY are ATS and NONATS.
Device selection for all NVIDIA drivers can also be done using application profile keys. The NVIDIA driver installer puts a default profile in /usr/share/nvidia/nvidia-application-profiles-[verMaj.verMin]-rc and key documentation in /usr/share/nvidia/nvidia-application-profiles-[verMaj.verMin]-key-documentation.
The relevant app profile key for selecting a GPU on B300 Galaxy Workstation is DeviceModalityPreference; valid values are ATS, NONATS.
/usr/share/nvidia/nvidia-application-profiles-[verMaj.verMin]-rc shows in the profiles section:
"profiles" : [
{ "name": "UseATSGpuInMixedCoherencySystems ", "settings": [ "DeviceModalityPreference", 1 ] },
{ "name": "UseNonATSGpuInMixedCoherencySystems ", "settings": [ "DeviceModalityPreference", 2 ] }
]
Apps can then select the right profile using the rules section:
"rules" : [
{ "pattern": { "feature":"cmdline", "matches": "uvmConformance" } , "profile": "noATS" }
]
System level specifications for keys can be placed in /etc/nvidia/nvidia-application-profiles-rc
while user level specifications can be placed in ~/.nv/nvidia-application-profiles-rc
NOTE: User-level specifications override both system and NVIDIA driver defaults; system-level specifications override NVIDIA driver defaults.
For more information and best practices when using CUDA_VISIBLE_DEVICES with containers, see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/user-guide.html#cuda-visible-devices-with-containers.
Future Updates to the Mixed Coherency Environment#
A future software update will allow access to both coherent and non-coherent GPUs in CUDA within the same process. This will imply the following changes:
The B300 GPU will by default no longer appear as a NUMA node in the system; it will switch to Coherent Driver-based Memory Management (CDMM) mode instead. In CDMM mode, the CPU memory is managed by the Linux kernel and the GPU memory is managed by the NVIDIA driver. This means the NVIDIA driver, not the OS, is responsible for managing the GPU memory and has full control over how the GPU memory is used, thereby offering greater control and often better application performance.
Both GPUs (B300 and RTX PRO) will be accessible in CUDA in the same process.
The system can still be configured to run in NUMA mode (along with the above limitations), which might be useful in scenarios where there is no dedicated RTX PRO GPU in the system.
Vulkan-CUDA Interoperability#
Since you may want to run compute-heavy tasks on the B300, while running Vulkan on the RTX PRO GPU, it is desirable to be able to share data between the two.
A process using Vulkan will currently not be able to access the B300 GPU - only the RTX PRO GPU will be accessible via CUDA. On top, if CUDA was initialized before Vulkan, it may bind to the B300 first, preventing Vulkan from properly initializing. Look out for VK_ERROR_INITIALIZATION_FAILED returned by vkCreateDevice()!
IPC mechanisms can be used to work around this. This may be cumbersome to implement and inefficient. Thus, if you have the choice, the recommendation is to wait for the aforementioned software update that will enable CUDA to use both coherent and non-coherent GPUs in the same process.
Once the software update is available, one can use the Vulkan-CUDA interoperability feature to share data between the two. The only limitation remaining then will be that the coherent and non-coherent GPUs cannot share the same memory. Instead, data will have to be copied from one to the other. CUDA’s memory transfer APIs can be used to achieve this in an optimal manner.
To share data between a Vulkan instance running on the RTX PRO and a CUDA context running on the B300, the following steps can be taken:
Allocate and export memory from Vulkan (RTX PRO). Create a buffer with
VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT, allocate device memory withVkExportMemoryAllocateInfoin thepNextchain, then obtain a file descriptor withvkGetMemoryFdKHR.Import that memory into CUDA on the RTX PRO. Use the CUDA driver API
cuImportExternalMemorywithCU_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FDand the fd from Vulkan, thencuExternalMemoryGetMappedBufferto get a device pointer valid in the RTX PRO CUDA context.Copy from RTX PRO to B300. Allocate the destination buffer on the B300 with
cudaMalloc(with the current device set to the B300), then usecudaMemcpy[Async]to transfer from the RTX PRO device pointer to the B300 device pointer. Peer-to-peer transfer APIs are not supported between the two GPUs.Use CUDA events to synchronize the streams on both GPUs.
The following snippet assumes both GPUs are visible to CUDA (for example, CUDA_VISIBLE_DEVICES=0,1). Vulkan is initialized on the RTX PRO; the fd is obtained from Vulkan and omitted here.
/* 1) Vulkan (on RTX PRO): create buffer with VkExternalMemoryBufferCreateInfo
* (handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT), allocate
* memory with VkExportMemoryAllocateInfo, bind, then:
* vkGetMemoryFdKHR(device, &(VkMemoryGetFdInfoKHR){ .memory = vkMem, ... }, &fd);
*/
int fd; /* from vkGetMemoryFdKHR */
size_t size = 1024 * 1024;
/* 2) Import into CUDA on the RTX PRO (device 0) */
const int RTX_PRO_DEVICE = 0;
const int B300_DEVICE = 1;
cudaSetDevice(RTX_PRO_DEVICE);
CUexternalMemory extMem;
CU_EXTERNAL_MEMORY_HANDLE_DESC handleDesc = {
.type = CU_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD,
.handle.fd = fd
};
cuImportExternalMemory(&extMem, &handleDesc);
CU_EXTERNAL_MEMORY_BUFFER_DESC bufferDesc = { .offset = 0, .size = size };
CUdeviceptr d_rtx;
cuExternalMemoryGetMappedBuffer(&d_rtx, extMem, &bufferDesc);
/* 3) Allocate on B300 and copy peer-to-peer (RTX PRO -> B300) */
cudaSetDevice(B300_DEVICE);
void* d_B300 = NULL;
cudaMalloc(&d_B300, size);
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync(d_B300,
(const void*)d_rtx,
size,
cudaMemcpyDeviceToDevice,
stream);