core.resharding.nvshmem_copy_service.core.gpu_resource_manager#

GPU resource management for NVSHMEM operations.

Handles NVSHMEM initialization, CUDA device setup, stream management, and event lifecycle.

Module Contents#

Classes#

GPUResourceManager

Manages GPU resources including NVSHMEM, streams, and events.

Data#

API#

core.resharding.nvshmem_copy_service.core.gpu_resource_manager.logger#

‘getLogger(…)’

class core.resharding.nvshmem_copy_service.core.gpu_resource_manager.GPUResourceManager#

Manages GPU resources including NVSHMEM, streams, and events.

Initialization

init() None#

Initialize NVSHMEM, CUDA device, and streams.

Expects torch.distributed to be already initialized.

get_stream(name: str)#

Get CUDA stream by name.

Parameters:

name – Stream name (‘pack’, ‘unpack’, ‘send’, ‘copy’)

Returns:

CUDA stream object

get_torch_stream(
name: str,
) Optional[torch.cuda.ExternalStream]#

Get PyTorch ExternalStream by name.

Parameters:

name – Stream name (‘pack’, ‘unpack’, ‘send’, ‘copy’)

Returns:

PyTorch ExternalStream

create_events(num_events: int = 2)#

Create double-buffered CUDA events for pack and unpack operations.

Parameters:

num_events – Number of events to create for each type (default: 2 for double buffering)

Returns:

(pack_events, unpack_events) lists of torch.cuda.Event

Return type:

tuple

finalize() None#

Cleanup resources (streams are automatically managed by CUDA).