core.inference.unified_memory#

Module Contents#

Classes#

CompilationState

Enum to distinguish between unified memory (UVM) compilation states.

Functions#

_compile_timeout

Context manager to timeout compilation.

compile_allocator

Attempt to compile UVM allocator.

create_unified_mempool

Create a unified memory mempool using CUDA managed memory.

_get_ctypes_lib

Return a ctypes handle to the compiled UVM extension (.so).

prefetch_managed_tensor

Prefetch a CUDA tensor allocated from the UVM mempool to a specific device.

advise_managed_tensor_preferred_location

Set the preferred physical location hint for a managed tensor.

advise_managed_tensor_accessed_by

Hint that a specific device will access the managed tensor.

prefetch_managed_module_parameters

Prefetch all UVM-allocated parameters (and optionally buffers) of a module.

advise_managed_module_parameters_preferred_location

Set the preferred physical location hint for all UVM parameters in a module.

Data#

API#

class core.inference.unified_memory.CompilationState(*args, **kwds)#

Bases: enum.Enum

Enum to distinguish between unified memory (UVM) compilation states.

Initialization

UNATTEMPTED#

‘auto(…)’

FAILURE#

‘auto(…)’

SUCCESS#

‘auto(…)’

exception core.inference.unified_memory.UnifiedMemoryUnsupportedError#

Bases: Exception

Unified memory is not supported on this system.

Initialization

Initialize self. See help(type(self)) for accurate signature.

exception core.inference.unified_memory.UnifiedMemoryCompileTimeoutError#

Bases: core.inference.unified_memory.UnifiedMemoryUnsupportedError

Unified memory compilation timed out.

Initialization

Initialize self. See help(type(self)) for accurate signature.

core.inference.unified_memory._compilation_state#

None

core.inference.unified_memory._alloc#

None

core.inference.unified_memory._mod#

None

core.inference.unified_memory._so_path#

None

core.inference.unified_memory._ctypes_lib#

None

core.inference.unified_memory._ctypes_lock#

‘Lock(…)’

core.inference.unified_memory._compilation_error: str | None#

None

core.inference.unified_memory._compile_timeout(timeout_s: int)#

Context manager to timeout compilation.

Parameters:

timeout_s (int) – Timeout in seconds.

core.inference.unified_memory.compile_allocator()#

Attempt to compile UVM allocator.

core.inference.unified_memory.create_unified_mempool() torch.cuda.memory.MemPool#

Create a unified memory mempool using CUDA managed memory.

Returns:

(MemPool) Unified memory mempool.

core.inference.unified_memory._get_ctypes_lib() ctypes.CDLL#

Return a ctypes handle to the compiled UVM extension (.so).

core.inference.unified_memory.prefetch_managed_tensor(tensor, *, device: int, stream=None) None#

Prefetch a CUDA tensor allocated from the UVM mempool to a specific device.

This uses cudaMemPrefetchAsync to physically migrate the pages backing the tensor. The virtual address (pointer) remains unchanged, making this safe for use with recorded CUDA graphs.

Parameters:
  • tensor (torch.Tensor) – CUDA tensor allocated from the UVM mempool.

  • device (int) – Target device ID. Use -1 (cudaCpuDeviceId) to prefetch to CPU.

  • stream (torch.cuda.Stream, optional) – Stream to use for the asynchronous prefetch. Defaults to the current stream.

core.inference.unified_memory.advise_managed_tensor_preferred_location(
tensor,
*,
device: int,
) None#

Set the preferred physical location hint for a managed tensor.

This uses cudaMemAdviseSetPreferredLocation. It tells the CUDA driver where the pages should ideally reside. Unlike prefetch, this is a hint and does not immediately trigger migration unless the driver decides it is necessary.

Parameters:
  • tensor (torch.Tensor) – CUDA tensor allocated from the UVM mempool.

  • device (int) – Preferred device ID. Use -1 (cudaCpuDeviceId) for CPU.

core.inference.unified_memory.advise_managed_tensor_accessed_by(tensor, *, device: int) None#

Hint that a specific device will access the managed tensor.

This uses cudaMemAdviseSetAccessedBy. It ensures that the mapping for this memory region is established in the page tables of the specified device, reducing page fault latency when the device first touches the data.

Parameters:
  • tensor (torch.Tensor) – CUDA tensor allocated from the UVM mempool.

  • device (int) – Device ID that will access the tensor. Must be a GPU ID.

core.inference.unified_memory.prefetch_managed_module_parameters(
module,
*,
device: int,
include_buffers: bool = False,
) int#

Prefetch all UVM-allocated parameters (and optionally buffers) of a module.

Iterates through all parameters of the module and initiates an asynchronous migration to the target device. This is typically used to offload weights to CPU during training or prefetch them to GPU before inference.

Parameters:
  • module (torch.nn.Module) – The module containing UVM parameters.

  • device (int) – Target device ID (-1 for CPU).

  • include_buffers (bool, optional) – Whether to also prefetch module buffers. Defaults to False.

Returns:

The total number of bytes for which prefetch was initiated.

Return type:

int

core.inference.unified_memory.advise_managed_module_parameters_preferred_location(
module,
*,
device: int,
include_buffers: bool = False,
) None#

Set the preferred physical location hint for all UVM parameters in a module.

Parameters:
  • module (torch.nn.Module) – The module containing UVM parameters.

  • device (int) – Preferred device ID (-1 for CPU).

  • include_buffers (bool, optional) – Whether to also advise on module buffers. Defaults to False.