core.resharding.nvshmem_copy_service.core.gpu_resource_manager#
GPU resource management for NVSHMEM operations.
Handles NVSHMEM initialization, CUDA device setup, stream management, and event lifecycle.
Module Contents#
Classes#
Manages GPU resources including NVSHMEM, streams, and events. |
Data#
API#
- core.resharding.nvshmem_copy_service.core.gpu_resource_manager.logger#
‘getLogger(…)’
- class core.resharding.nvshmem_copy_service.core.gpu_resource_manager.GPUResourceManager#
Manages GPU resources including NVSHMEM, streams, and events.
Initialization
- init() None#
Initialize NVSHMEM, CUDA device, and streams.
Expects torch.distributed to be already initialized.
- get_stream(name: str)#
Get CUDA stream by name.
- Parameters:
name – Stream name (‘pack’, ‘unpack’, ‘send’, ‘copy’)
- Returns:
CUDA stream object
- get_torch_stream(
- name: str,
Get PyTorch ExternalStream by name.
- Parameters:
name – Stream name (‘pack’, ‘unpack’, ‘send’, ‘copy’)
- Returns:
PyTorch ExternalStream
- create_events(num_events: int = 2)#
Create double-buffered CUDA events for pack and unpack operations.
- Parameters:
num_events – Number of events to create for each type (default: 2 for double buffering)
- Returns:
(pack_events, unpack_events) lists of torch.cuda.Event
- Return type:
tuple
- finalize() None#
Cleanup resources (streams are automatically managed by CUDA).