4.3. Stream-Ordered Memory Allocator#

4.3.1. Introduction#

Managing memory allocations using cudaMalloc and cudaFree causes the GPU to synchronize across all executing CUDA streams. The stream-ordered memory allocator enables applications to order memory allocation and deallocation with other work launched into a CUDA stream such as kernel launches and asynchronous copies. This improves application memory use by taking advantage of stream-ordering semantics to reuse memory allocations. The allocator also allows applications to control the allocator’s memory caching behavior. When set up with an appropriate release threshold, the caching behavior allows the allocator to avoid expensive calls into the OS when the application indicates it is willing to accept a bigger memory footprint. The allocator also supports easy and secure allocation sharing between processes.

Stream-Ordered Memory Allocator:

Reduces the need for custom memory management abstractions, and makes it easier to create high-performance custom memory management for applications that need it.

Enables multiple libraries to share a common memory pool managed by the driver. This can reduce excess memory consumption.

Allows, the driver to perform optimizations based on its awareness of the allocator and other stream management APIs.

Note

Nsight Compute and the Next-Gen CUDA debugger is aware of the allocator since CUDA 11.3.

4.3.2. Memory Management#

cudaMallocAsync and cudaFreeAsync are the APIs which enable stream-ordered memory management. cudaMallocAsync returns an allocation and cudaFreeAsync frees an allocation. Both APIs accept stream arguments to define when the allocation will become and stop being available for use. These functions allow memory operations to be tied to specific CUDA streams, enabling them to occur without blocking the host or other streams. Application performance can be improved by avoiding potentially costly synchronization of cudaMalloc and cudaFree.

These APIs can be used for further performance optimization through memory pools, which manage and reuse large blocks of memory for more efficient allocation and deallocation. Memory pools help reduce overhead and prevent fragmentation, improving performance in scenarios with frequent memory allocation operations.

4.3.2.1. Allocating Memory#

The cudaMallocAsync function triggers asynchronous memory allocation on the GPU, linked to a specific CUDA stream. cudaMallocAsync allows memory allocation to occur without hindering the host or other streams, eliminating the need for expensive synchronization.

Note

cudaMallocAsync ignores the current device/context when determining where the allocation will reside. Instead, cudaMallocAsync determines the appropriate device based on the specified memory pool or the supplied stream.

The listing below illustrates a fundamental use pattern: the memory is allocated, used, and then freed back into the same stream.

void *ptr;
size_t size = 512;
cudaMallocAsync(&ptr, size, cudaStreamPerThread);
// do work using the allocation
kernel<<<..., cudaStreamPerThread>>>(ptr, ...);
// An asynchronous free can be specified without synchronizing the cpu and GPU
cudaFreeAsync(ptr, cudaStreamPerThread);

Note

When accessing allocation from a stream other than the stream that made the allocation, the user must guarantee that the access occurs after the allocation operation, otherwise the behavior is undefined.

4.3.2.2. Freeing Memory#

cudaFreeAsync() asynchronously frees device memory in a stream-ordered fashion, meaning the memory deallocation is assigned to a specific CUDA stream and does not block the host or other streams.

The user must guarantee that the free operation happens after the allocation operation and any uses of the allocation. Any use of the allocation after the free operation starts results in undefined behavior.

Events and/or stream synchronizing operations should be used to guarantee any access to the allocation from other streams is complete before the free operation begins, as illustrated in the following example.

cudaMallocAsync(&ptr, size, stream1);
cudaEventRecord(event1, stream1);
//stream2 must wait for the allocation to be ready before accessing
cudaStreamWaitEvent(stream2, event1);
kernel<<<..., stream2>>>(ptr, ...);
cudaEventRecord(event2, stream2);
// stream3 must wait for stream2 to finish accessing the allocation before
// freeing the allocation
cudaStreamWaitEvent(stream3, event2);
cudaFreeAsync(ptr, stream3);

Memory allocated with cudaMalloc() can be freed with with cudaFreeAsync(). As above, all accesses to the memory must be complete before the free operation begins.

cudaMalloc(&ptr, size);
kernel<<<..., stream>>>(ptr, ...);
cudaFreeAsync(ptr, stream);

Likewise, memory allocated with cudaMallocAsync can be freed with cudaFree(). When freeing such allocations through the cudaFree() API, the driver assumes that all accesses to the allocation are complete and performs no further synchronization. The user can use cudaStreamQuery / cudaStreamSynchronize / cudaEventQuery / cudaEventSynchronize / cudaDeviceSynchronize to guarantee that the appropriate asynchronous work is complete and that the GPU will not try to access the allocation.

cudaMallocAsync(&ptr, size,stream);
kernel<<<..., stream>>>(ptr, ...);
// synchronize is needed to avoid prematurely freeing the memory
cudaStreamSynchronize(stream);
cudaFree(ptr);

4.3.3. Memory Pools#

Memory pools encapsulate virtual address and physical memory resources that are allocated and managed according to the pools attributes and properties. The primary aspect of a memory pool is the kind and location of memory it manages.

All calls to cudaMallocAsync use resources from memory pool. If a memory pool is not specified, cudaMallocAsync uses the current memory pool of the supplied stream’s device. The current memory pool for a device may be set with cudaDeviceSetMempool and queried with cudaDeviceGetMempool. Each device has a default memory pool, which is active if cudaDeviceSetMempool has not been called.

The API cudaMallocFromPoolAsync and c++ overloads of cudaMallocAsync allow a user to specify the pool to be used for an allocation without setting it as the current pool. The APIs cudaDeviceGetDefaultMempool and cudaMemPoolCreate return handles to memory pools. cudaMemPoolSetAttribute and cudaMemPoolGetAttribute control the attributes of memory pools.

Note

The mempool current to a device will be local to that device. So allocating without specifying a memory pool will always yield an allocation local to the stream’s device.

4.3.3.1. Default/Implicit Pools#

The default memory pool of a device can be retrieved by calling cudaDeviceGetDefaultMempool. Allocations from the default memory pool of a device are non-migratable device allocation located on that device. These allocations will always be accessible from that device. The accessibility of the default memory pool can be modified with cudaMemPoolSetAccess and queried with cudaMemPoolGetAccess. Since the default pools do not need to be explicitly created, they are sometimes referred to as implicit pools. The default memory pool of a device does not support IPC.

4.3.3.2. Explicit Pools#

cudaMemPoolCreate creates an explicit pool. This allows applications to request properties for their allocation beyond what is provided by the default/implicit pools. These include properties such as IPC capability, maximum pool size, allocations resident on a specific CPU NUMA node on supported platforms etc.

// create a pool similar to the implicit pool on device 0
int device = 0;
cudaMemPoolProps poolProps = { };
poolProps.allocType = cudaMemAllocationTypePinned;
poolProps.location.id = device;
poolProps.location.type = cudaMemLocationTypeDevice;

cudaMemPoolCreate(&memPool, &poolProps));

The following code snippet illustrates an example of creating an IPC capable memory pool on a valid CPU NUMA node.

// create a pool resident on a CPU NUMA node that is capable of IPC sharing (via a file descriptor).
int cpu_numa_id = 0;
cudaMemPoolProps poolProps = { };
poolProps.allocType = cudaMemAllocationTypePinned;
poolProps.location.id = cpu_numa_id;
poolProps.location.type = cudaMemLocationTypeHostNuma;
poolProps.handleType = cudaMemHandleTypePosixFileDescriptor;

cudaMemPoolCreate(&ipcMemPool, &poolProps));

4.3.3.3. Device Accessibility for Multi-GPU Support#

Like allocation accessibility controlled through the virtual memory management APIs, memory pool allocation accessibility does not follow cudaDeviceEnablePeerAccess or cuCtxEnablePeerAccess. For memory pools, the API cudaMemPoolSetAccess modifies what devices can access allocations from a pool. By default, allocations are accessible only from the device where the allocations are located. This access cannot be revoked. To enable access from other devices, the accessing device must be peer capable with the memory pool’s device. This can be verified with cudaDeviceCanAccessPeer. If the peer capability is not checked, the set access may fail with cudaErrorInvalidDevice. However, if no allocations had been made from the pool, the cudaMemPoolSetAccess call may succeed even when the devices are not peer capable. In this case, the next allocation from the pool will fail.

It is worth noting that cudaMemPoolSetAccess affects all allocations from the memory pool, not just future ones. Likewise, the accessibility reported by cudaMemPoolGetAccess applies to all allocations from the pool, not just future ones. Changing the accessibility settings of a pool for a given GPU frequently is not recommended. That is, once a pool is made accessible from a given GPU, it should remain accessible from that GPU for the lifetime of the pool.

// snippet showing usage of cudaMemPoolSetAccess:
cudaError_t setAccessOnDevice(cudaMemPool_t memPool, int residentDevice,
              int accessingDevice) {
    cudaMemAccessDesc accessDesc = {};
    accessDesc.location.type = cudaMemLocationTypeDevice;
    accessDesc.location.id = accessingDevice;
    accessDesc.flags = cudaMemAccessFlagsProtReadWrite;

    int canAccess = 0;
    cudaError_t error = cudaDeviceCanAccessPeer(&canAccess, accessingDevice,
              residentDevice);
    if (error != cudaSuccess) {
        return error;
    } else if (canAccess == 0) {
        return cudaErrorPeerAccessUnsupported;
    }

    // Make the address accessible
    return cudaMemPoolSetAccess(memPool, &accessDesc, 1);
}

4.3.3.4. Enabling Memory Pools for IPC#

Memory pools can be enabled for interprocess communication (IPC) to allow easy, efficient and secure sharing of GPU memory between processes. CUDA’s IPC memory pools provide the same security benefits as CUDA’s virtual memory management APIs.

There are two steps to sharing memory between processes with memory pools: the processes first needs to share access to the pool, then share specific allocations from that pool. The first step establishes and enforces security. The second step coordinates what virtual addresses are used in each process and when mappings need to be valid in the importing process.

4.3.3.4.2. Set Access in the Importing Process#

Imported memory pools are initially only accessible from their resident device. The imported memory pool does not inherit any accessibility set by the exporting process. The importing process needs to enable access with cudaMemPoolSetAccess from any GPU it plans to access the memory from.

If the imported memory pool belongs to a device that is not visible to importing process, the user must use the cudaMemPoolSetAccess API to enable access from the GPUs the allocations will be used on. (See Device Accessibility for Multi-GPU Support)

4.3.3.4.3. Creating and Sharing Allocations from an Exported Pool#

Once the pool has been shared, allocations made with cudaMallocAsync() from the pool in the exporting process can be shared with processes that have imported the pool. Since the pool’s security policy is established and verified at the pool level, the OS does not need extra bookkeeping to provide security for specific pool allocations. In other words, the opaque cudaMemPoolPtrExportData required to import a pool allocation may be sent to the importing process using any mechanism.

While allocations may be exported and imported without synchronizing with the allocating stream in any way, the importing process must follow the same rules as the exporting process when accessing the allocation. Specifically, access to the allocation must happen after the allocation operation in the allocating stream executes. The two following code snippets show cudaMemPoolExportPointer() and cudaMemPoolImportPointer() sharing the allocation with an IPC event used to guarantee that the allocation isn’t accessed in the importing process before the allocation is ready.

// preparing an allocation in the exporting process
cudaMemPoolPtrExportData exportData;
cudaEvent_t readyIpcEvent;
cudaIpcEventHandle_t readyIpcEventHandle;

// ipc event for coordinating between processes
// cudaEventInterprocess flag makes the event an ipc event
// cudaEventDisableTiming  is set for performance reasons

cudaEventCreate(&readyIpcEvent, cudaEventDisableTiming | cudaEventInterprocess)

// allocate from the exporting mem pool
cudaMallocAsync(&ptr, size,exportMemPool, stream);

// event for sharing when the allocation is ready.
cudaEventRecord(readyIpcEvent, stream);
cudaMemPoolExportPointer(&exportData, ptr);
cudaIpcGetEventHandle(&readyIpcEventHandle, readyIpcEvent);

// Share IPC event and pointer export data with the importing process using
//  any mechanism. Here we copy the data into shared memory
shmem->ptrData = exportData;
shmem->readyIpcEventHandle = readyIpcEventHandle;
// signal consumers data is ready

// Importing an allocation
cudaMemPoolPtrExportData *importData = &shmem->prtData;
cudaEvent_t readyIpcEvent;
cudaIpcEventHandle_t *readyIpcEventHandle = &shmem->readyIpcEventHandle;

// Need to retrieve the ipc event handle and the export data from the
// exporting process using any mechanism.  Here we are using shmem and just
// need synchronization to make sure the shared memory is filled in.

cudaIpcOpenEventHandle(&readyIpcEvent, readyIpcEventHandle);

// import the allocation. The operation does not block on the allocation being ready.
cudaMemPoolImportPointer(&ptr, importedMemPool, importData);

// Wait for the prior stream operations in the allocating stream to complete before
// using the allocation in the importing process.
cudaStreamWaitEvent(stream, readyIpcEvent);
kernel<<<..., stream>>>(ptr, ...);

When freeing the allocation, the allocation must be freed in the importing process before it is freed in the exporting process. The following code snippet demonstrates the use of CUDA IPC events to provide the required synchronization between the cudaFreeAsync operations in both processes. Access to the allocation from the importing process is obviously restricted by the free operation in the importing process side. It is worth noting that cudaFree can be used to free the allocation in both processes and that other stream synchronization APIs may be used instead of CUDA IPC events.

// The free must happen in importing process before the exporting process
kernel<<<..., stream>>>(ptr, ...);

// Last access in importing process
cudaFreeAsync(ptr, stream);

// Access not allowed in the importing process after the free
cudaIpcEventRecord(finishedIpcEvent, stream);

// Exporting process
// The exporting process needs to coordinate its free with the stream order
// of the importing process’s free.
cudaStreamWaitEvent(stream, finishedIpcEvent);
kernel<<<..., stream>>>(ptrInExportingProcess, ...);

// The free in the importing process doesn’t stop the exporting process
// from using the allocation.
cudFreeAsync(ptrInExportingProcess,stream);

4.3.3.4.4. IPC Export Pool Limitations#

IPC pools currently do not support releasing physical blocks back to the OS. As a result the cudaMemPoolTrimTo API has no effect and the cudaMemPoolAttrReleaseThreshold is effectively ignored. This behavior is controlled by the driver, not the runtime and may change in a future driver update.

4.3.3.4.5. IPC Import Pool Limitations#

Allocating from an import pool is not allowed; specifically, import pools cannot be set current and cannot be used in the cudaMallocFromPoolAsync API. As such, the allocation reuse policy attributes do not have meaning for these pools.

IPC Import pools, like IPC export pools, currently do not support releasing physical blocks back to the OS.

The resource usage stat attribute queries only reflect the allocations imported into the process and the associated physical memory.

4.3.4. Best Practices and Tuning#

4.3.4.1. Query for Support#

An application can determine whether or not a device supports the stream-ordered memory allocator by calling cudaDeviceGetAttribute() (see developer blog) with the device attribute cudaDevAttrMemoryPoolsSupported.

IPC memory pool support can be queried with the cudaDevAttrMemoryPoolSupportedHandleTypes device attribute. This attribute was added in CUDA 11.3, and older drivers will return cudaErrorInvalidValue when this attribute is queried.

int driverVersion = 0;
int deviceSupportsMemoryPools = 0;
int poolSupportedHandleTypes = 0;
cudaDriverGetVersion(&driverVersion);
if (driverVersion >= 11020) {
    cudaDeviceGetAttribute(&deviceSupportsMemoryPools,
                           cudaDevAttrMemoryPoolsSupported, device);
}
if (deviceSupportsMemoryPools != 0) {
    // `device` supports the Stream-Ordered Memory Allocator
}

if (driverVersion >= 11030) {
    cudaDeviceGetAttribute(&poolSupportedHandleTypes,
              cudaDevAttrMemoryPoolSupportedHandleTypes, device);
}
if (poolSupportedHandleTypes & cudaMemHandleTypePosixFileDescriptor) {
   // Pools on the specified device can be created with posix file descriptor-based IPC
}

Performing the driver version check before the query avoids hitting a cudaErrorInvalidValue error on drivers where the attribute was not yet defined. One can use cudaGetLastError to clear the error instead of avoiding it.

4.3.4.2. Physical Page Caching Behavior#

By default, the allocator tries to minimize the physical memory owned by a pool. To minimize the OS calls to allocate and free physical memory, applications must configure a memory footprint for each pool. Applications can do this with the release threshold attribute (cudaMemPoolAttrReleaseThreshold).

The release threshold is the amount of memory in bytes a pool should hold onto before trying to release memory back to the OS. When more than the release threshold bytes of memory are held by the memory pool, the allocator will try to release memory back to the OS on the next call to stream, event or device synchronize. Setting the release threshold to UINT64_MAX will prevent the driver from attempting to shrink the pool after every synchronization.

Cuuint64_t setVal = UINT64_MAX;
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &setVal);

Applications that set cudaMemPoolAttrReleaseThreshold high enough to effectively disable memory pool shrinking may wish to explicitly shrink a memory pool’s memory footprint. cudaMemPoolTrimTo allows applications to do so. When trimming a memory pool’s footprint, the minBytesToKeep parameter allows an application to hold onto a specified amount of memory, for example the amount it expects to need in a subsequent phase of execution.

Cuuint64_t setVal = UINT64_MAX;
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &setVal);

// application phase needing a lot of memory from the stream-ordered allocator
for (i=0; i<10; i++) {
    for (j=0; j<10; j++) {
        cudaMallocAsync(&ptrs[j],size[j], stream);
    }
    kernel<<<...,stream>>>(ptrs,...);
    for (j=0; j<10; j++) {
        cudaFreeAsync(ptrs[j], stream);
    }
}

// Process does not need as much memory for the next phase.
// Synchronize so that the trim operation will know that the allocations are no
// longer in use.
cudaStreamSynchronize(stream);
cudaMemPoolTrimTo(mempool, 0);

// Some other process/allocation mechanism can now use the physical memory
// released by the trimming operation.

4.3.4.3. Resource Usage Statistics#

Querying the cudaMemPoolAttrReservedMemCurrent attribute of a pool reports the current total physical GPU memory consumed by the pool. Querying the cudaMemPoolAttrUsedMemCurrent of a pool returns the total size of all of the memory allocated from the pool and not available for reuse.

ThecudaMemPoolAttr*MemHigh attributes are watermarks recording the max value achieved by the respective cudaMemPoolAttr*MemCurrent attribute since last reset. They can be reset to the current value by using the cudaMemPoolSetAttribute API.

// sample helper functions for getting the usage statistics in bulk
struct usageStatistics {
    cuuint64_t reserved;
    cuuint64_t reservedHigh;
    cuuint64_t used;
    cuuint64_t usedHigh;
};

void getUsageStatistics(cudaMemoryPool_t memPool, struct usageStatistics *statistics)
{
    cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrReservedMemCurrent, statistics->reserved);
    cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrReservedMemHigh, statistics->reservedHigh);
    cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrUsedMemCurrent, statistics->used);
    cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrUsedMemHigh, statistics->usedHigh);
}


// resetting the watermarks will make them take on the current value.
void resetStatistics(cudaMemoryPool_t memPool)
{
    cuuint64_t value = 0;
    cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReservedMemHigh, &value);
    cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrUsedMemHigh, &value);
}

4.3.4.4. Memory Reuse Policies#

In order to service an allocation request, the driver attempts to reuse memory that was previously freed via cudaFreeAsync() before attempting to allocate more memory from the OS. For example, memory freed in a stream can be reused immediately in a subsequent allocation request on the same stream. When a stream is synchronized with the CPU, the memory that was previously freed in that stream becomes available for reuse for an allocation in any stream. Reuse policies can be applied to both default and explicit memory pools.

The stream-ordered allocator has a few controllable allocation policies. The pool attributes cudaMemPoolReuseFollowEventDependencies, cudaMemPoolReuseAllowOpportunistic, and cudaMemPoolReuseAllowInternalDependencies control these policies and are detailed below. These policies can be enabled or disabled through a call to cudaMemPoolSetAttribute. Upgrading to a newer CUDA driver may change, enhance, augment and/or reorder the enumeration of the reuse policies.

4.3.4.4.1. cudaMemPoolReuseFollowEventDependencies#

Before allocating more physical GPU memory, the allocator examines dependency information established by CUDA events and tries to allocate from memory freed in another stream.

cudaMallocAsync(&ptr, size, originalStream);
kernel<<<..., originalStream>>>(ptr, ...);
cudaFreeAsync(ptr, originalStream);
cudaEventRecord(event,originalStream);

// waiting on the event that captures the free in another stream
// allows the allocator to reuse the memory to satisfy
// a new allocation request in the other stream when
// cudaMemPoolReuseFollowEventDependencies is enabled.
cudaStreamWaitEvent(otherStream, event);
cudaMallocAsync(&ptr2, size, otherStream);

4.3.4.4.2. cudaMemPoolReuseAllowOpportunistic#

When the cudaMemPoolReuseAllowOpportunistic policy is enabled, the allocator examines freed allocations to see if the free operations stream order semantic has been met, for example the stream has passed the point of execution indicated by the free operation. When this policy is disabled, the allocator will still reuse memory made available when a stream is synchronized with the CPU. Disabling this policy does not stop the cudaMemPoolReuseFollowEventDependencies from applying.

cudaMallocAsync(&ptr, size, originalStream);
kernel<<<..., originalStream>>>(ptr, ...);
cudaFreeAsync(ptr, originalStream);


// after some time, the kernel finishes running
wait(10);

// When cudaMemPoolReuseAllowOpportunistic is enabled this allocation request
// can be fulfilled with the prior allocation based on the progress of originalStream.
cudaMallocAsync(&ptr2, size, otherStream);

4.3.4.4.3. cudaMemPoolReuseAllowInternalDependencies#

Failing to allocate and map more physical memory from the OS, the driver will look for memory whose availability depends on another stream’s pending progress. If such memory is found, the driver will insert the required dependency into the allocating stream and reuse the memory.

cudaMallocAsync(&ptr, size, originalStream);
kernel<<<..., originalStream>>>(ptr, ...);
cudaFreeAsync(ptr, originalStream);

// When cudaMemPoolReuseAllowInternalDependencies is enabled
// and the driver fails to allocate more physical memory, the driver may
// effectively perform a cudaStreamWaitEvent in the allocating stream
// to make sure that future work in ‘otherStream’ happens after the work
// in the original stream that would be allowed to access the original allocation.
cudaMallocAsync(&ptr2, size, otherStream);

4.3.4.4.4. Disabling Reuse Policies#

While the controllable reuse policies improve memory reuse, users may want to disable them. Allowing opportunistic reuse (such as cudaMemPoolReuseAllowOpportunistic) introduces run to run variance in allocation patterns based on the interleaving of CPU and GPU execution. Internal dependency insertion (such as cudaMemPoolReuseAllowInternalDependencies) can serialize work in unexpected and potentially non-deterministic ways when the user would rather explicitly synchronize an event or stream on allocation failure.

4.3.4.5. Synchronization API Actions#

One of the optimizations that comes with the allocator being part of the CUDA driver is integration with the synchronize APIs. When the user requests that the CUDA driver synchronize, the driver waits for asynchronous work to complete. Before returning, the driver will determine what frees the synchronization guaranteed to be completed. These allocations are made available for allocation regardless of specified stream or disabled allocation policies. The driver also checks cudaMemPoolAttrReleaseThreshold here and releases any excess physical memory that it can.

4.3.5. Addendums#

4.3.5.1. cudaMemcpyAsync Current Context/Device Sensitivity#

In the current CUDA driver, any async memcpy involving memory from cudaMallocAsync should be done using the specified stream’s context as the calling thread’s current context. This is not necessary for cudaMemcpyPeerAsync, as the device primary contexts specified in the API are referenced instead of the current context.

4.3.5.2. cudaPointerGetAttributes Query#

Invoking cudaPointerGetAttributes on an allocation after invoking cudaFreeAsync on it results in undefined behavior. Specifically, it does not matter if an allocation is still accessible from a given stream: the behavior is still undefined.

4.3.5.3. cudaGraphAddMemsetNode#

cudaGraphAddMemsetNode does not work with memory allocated via the stream ordered allocator. However, memsets of the allocations can be stream captured.

4.3.5.4. Pointer Attributes#

The cudaPointerGetAttributes query works on stream-ordered allocations. Since stream-ordered allocations are not context associated, querying CU_POINTER_ATTRIBUTE_CONTEXT will succeed but return NULL in *data. The attribute CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL can be used to determine the location of the allocation: this can be useful when selecting a context for making p2h2p copies using cudaMemcpyPeerAsync. The attribute CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE was added in CUDA 11.3 and can be useful for debugging and for confirming which pool an allocation comes from before doing IPC.

4.3.5.5. CPU Virtual Memory#

When using CUDA stream-ordered memory allocator APIs, avoid setting VRAM limitations with “ulimit -v” as this is not supported.

4.3. Stream-Ordered Memory Allocator#

4.3.1. Introduction#

4.3.2. Memory Management#

4.3.2.1. Allocating Memory#

4.3.2.2. Freeing Memory#

4.3.3. Memory Pools#

4.3.3.1. Default/Implicit Pools#

4.3.3.2. Explicit Pools#

4.3.3.3. Device Accessibility for Multi-GPU Support#

4.3.3.4. Enabling Memory Pools for IPC#

4.3.3.4.1. Creating and Sharing IPC Memory Pools#

4.3.3.4.2. Set Access in the Importing Process#

4.3.3.4.3. Creating and Sharing Allocations from an Exported Pool#

4.3.3.4.4. IPC Export Pool Limitations#

4.3.3.4.5. IPC Import Pool Limitations#

4.3.4. Best Practices and Tuning#

4.3.4.1. Query for Support#

4.3.4.2. Physical Page Caching Behavior#

4.3.4.3. Resource Usage Statistics#

4.3.4.4. Memory Reuse Policies#

4.3.4.4.1. cudaMemPoolReuseFollowEventDependencies#

4.3.4.4.2. cudaMemPoolReuseAllowOpportunistic#

4.3.4.4.3. cudaMemPoolReuseAllowInternalDependencies#

4.3.4.4.4. Disabling Reuse Policies#

4.3.4.5. Synchronization API Actions#

4.3.5. Addendums#

4.3.5.1. cudaMemcpyAsync Current Context/Device Sensitivity#

4.3.5.2. cudaPointerGetAttributes Query#

4.3.5.3. cudaGraphAddMemsetNode#

4.3.5.4. Pointer Attributes#

4.3.5.5. CPU Virtual Memory#