`core.inference.unified_memory`#

Module Contents#

Classes#

CompilationState

Enum to distinguish between unified memory (UVM) compilation states.

Functions#

`_compile_timeout`	Context manager to timeout compilation.
`compile_allocator`	Attempt to compile UVM allocator.
`create_unified_mempool`	Create a unified memory mempool using CUDA managed memory.
`_get_ctypes_lib`	Return a ctypes handle to the compiled UVM extension (.so).
`prefetch_managed_tensor`	Prefetch a CUDA tensor allocated from the UVM mempool to a specific device.
`advise_managed_tensor_preferred_location`	Set the preferred physical location hint for a managed tensor.
`advise_managed_tensor_accessed_by`	Hint that a specific device will access the managed tensor.
`prefetch_managed_module_parameters`	Prefetch all UVM-allocated parameters (and optionally buffers) of a module.
`advise_managed_module_parameters_preferred_location`	Set the preferred physical location hint for all UVM parameters in a module.

Data#

`_compilation_state`
`_alloc`
`_mod`
`_so_path`
`_ctypes_lib`
`_ctypes_lock`
`_compilation_error`

API#

class core.inference.unified_memory.CompilationState(*args, **kwds)#

Bases: enum.Enum

Enum to distinguish between unified memory (UVM) compilation states.

Initialization

UNATTEMPTED#: ‘auto(…)’

FAILURE#: ‘auto(…)’

SUCCESS#: ‘auto(…)’

exception core.inference.unified_memory.UnifiedMemoryUnsupportedError#

Bases: Exception

Unified memory is not supported on this system.

Initialization

Initialize self. See help(type(self)) for accurate signature.

exception core.inference.unified_memory.UnifiedMemoryCompileTimeoutError#

Bases: core.inference.unified_memory.UnifiedMemoryUnsupportedError

Unified memory compilation timed out.

Initialization

Initialize self. See help(type(self)) for accurate signature.

core.inference.unified_memory._compilation_state#: None

core.inference.unified_memory._alloc#: None

core.inference.unified_memory._mod#: None

core.inference.unified_memory._so_path#: None

core.inference.unified_memory._ctypes_lib#: None

core.inference.unified_memory._ctypes_lock#: ‘Lock(…)’

core.inference.unified_memory._compilation_error: str | None#: None

core.inference.unified_memory._compile_timeout(timeout_s: int)#

Context manager to timeout compilation.

Parameters:: timeout_s (int) – Timeout in seconds.

core.inference.unified_memory.compile_allocator()#: Attempt to compile UVM allocator.

core.inference.unified_memory.create_unified_mempool() → torch.cuda.memory.MemPool#

Create a unified memory mempool using CUDA managed memory.

Returns:: (MemPool) Unified memory mempool.

core.inference.unified_memory._get_ctypes_lib() → ctypes.CDLL#: Return a ctypes handle to the compiled UVM extension (.so).

core.inference.unified_memory.prefetch_managed_tensor(tensor, *, device: int, stream=None) → None#

Prefetch a CUDA tensor allocated from the UVM mempool to a specific device.

This uses cudaMemPrefetchAsync to physically migrate the pages backing the tensor. The virtual address (pointer) remains unchanged, making this safe for use with recorded CUDA graphs.

Parameters:

tensor (torch.Tensor) – CUDA tensor allocated from the UVM mempool.
device (int) – Target device ID. Use -1 (cudaCpuDeviceId) to prefetch to CPU.
stream (torch.cuda.Stream, optional) – Stream to use for the asynchronous prefetch. Defaults to the current stream.

core.inference.unified_memory.advise_managed_tensor_preferred_location( tensor, *, device: int, ) → None#

Set the preferred physical location hint for a managed tensor.

This uses cudaMemAdviseSetPreferredLocation. It tells the CUDA driver where the pages should ideally reside. Unlike prefetch, this is a hint and does not immediately trigger migration unless the driver decides it is necessary.

Parameters:

tensor (torch.Tensor) – CUDA tensor allocated from the UVM mempool.
device (int) – Preferred device ID. Use -1 (cudaCpuDeviceId) for CPU.

core.inference.unified_memory.advise_managed_tensor_accessed_by(tensor, *, device: int) → None#

Hint that a specific device will access the managed tensor.

This uses cudaMemAdviseSetAccessedBy. It ensures that the mapping for this memory region is established in the page tables of the specified device, reducing page fault latency when the device first touches the data.

Parameters:

tensor (torch.Tensor) – CUDA tensor allocated from the UVM mempool.
device (int) – Device ID that will access the tensor. Must be a GPU ID.

core.inference.unified_memory.prefetch_managed_module_parameters( module, *, device: int, include_buffers: bool = False, ) → int#

Prefetch all UVM-allocated parameters (and optionally buffers) of a module.

Iterates through all parameters of the module and initiates an asynchronous migration to the target device. This is typically used to offload weights to CPU during training or prefetch them to GPU before inference.

Parameters:

module (torch.nn.Module) – The module containing UVM parameters.
device (int) – Target device ID (-1 for CPU).
include_buffers (bool, optional) – Whether to also prefetch module buffers. Defaults to False.

Returns:

The total number of bytes for which prefetch was initiated.

Return type:

int

core.inference.unified_memory.advise_managed_module_parameters_preferred_location( module, *, device: int, include_buffers: bool = False, ) → None#

Set the preferred physical location hint for all UVM parameters in a module.

Parameters:

module (torch.nn.Module) – The module containing UVM parameters.
device (int) – Preferred device ID (-1 for CPU).
include_buffers (bool, optional) – Whether to also advise on module buffers. Defaults to False.

core.inference.unified_memory#

Module Contents#

Classes#

Functions#

Data#

API#

`core.inference.unified_memory`#