nemo_automodel.components.moe.uccl_ep._utils#
Module Contents#
Classes#
A wrapper class to manage CUDA events, also for better overlapping convenience. |
|
Functions#
Check NVLink connection between every pair of GPUs. |
|
Detect InfiniBand/RDMA HCA device. |
|
API#
- nemo_automodel.components.moe.uccl_ep._utils.calc_diff(x: torch.Tensor, y: torch.Tensor)#
- nemo_automodel.components.moe.uccl_ep._utils.hash_tensor(t: torch.Tensor)#
- nemo_automodel.components.moe.uccl_ep._utils.init_dist(local_rank: int, num_local_ranks: int)#
- nemo_automodel.components.moe.uccl_ep._utils.init_dist_under_torchrun(local_rank: int, num_local_ranks: int)#
- nemo_automodel.components.moe.uccl_ep._utils._gather_peer_ips(group)#
- nemo_automodel.components.moe.uccl_ep._utils.get_peer_ip(
- rank: int,
- num_ranks: int,
- group: torch.distributed.ProcessGroup,
- nemo_automodel.components.moe.uccl_ep._utils.get_cpu_proxies_meta(
- proxies,
- rank,
- scratch_ptr,
- scratch_bytes,
- num_ranks,
- group,
- nemo_automodel.components.moe.uccl_ep._utils.check_nvlink_connections(group: torch.distributed.ProcessGroup)#
Check NVLink connection between every pair of GPUs.
- Parameters:
group – the communication group.
- class nemo_automodel.components.moe.uccl_ep._utils.EventOverlap(
- event: Optional[uccl.ep.EventHandle] = None,
- extra_tensors: Optional[Tuple[torch.Tensor]] = None,
A wrapper class to manage CUDA events, also for better overlapping convenience.
.. attribute:: event
the CUDA event captured.
.. attribute:: extra_tensors
an easier way to simulate PyTorch tensor
record_stream, may be useful with CUDA graph.Initialization
Initialize the class.
- Parameters:
event – the CUDA event captured.
extra_tensors – an easier way to simulate PyTorch tensor
record_stream, may be useful with CUDA graph.
- current_stream_wait() None#
The current stream
torch.cuda.current_stream()waits for the event to be finished.
- __enter__() Any#
Utility for overlapping and Python
withsyntax.You can overlap the kernels on the current stream with the following example:
event_overlap = event_after_all_to_all_kernels() with event_overlap(): do_something_on_current_stream() # After exiting the `with` scope, the current stream with wait the event to be finished.
- __exit__(
- exc_type: Any,
- exc_val: Any,
- exc_tb: Any,
Utility for overlapping and Python
withsyntax.Please follow the example in the
__enter__function.
- nemo_automodel.components.moe.uccl_ep._utils.detect_ib_hca()#
Detect InfiniBand/RDMA HCA device.
Returns the first RDMA device name found (mlx5 for Mellanox, irdma for Intel), or None if no InfiniBand devices are available.
- nemo_automodel.components.moe.uccl_ep._utils.per_token_cast_back(x_fp8: torch.Tensor, x_scales: torch.Tensor)#
- class nemo_automodel.components.moe.uccl_ep._utils.suppress_stdout_stderr#
- __enter__()#
- __exit__(*_)#
- nemo_automodel.components.moe.uccl_ep._utils.bench(fn, num_warmups: int = 50, num_tests: int = 50, post_fn=None)#
- nemo_automodel.components.moe.uccl_ep._utils.bench_kineto(
- fn,
- kernel_names: Union[str, tuple],
- num_tests: int = 30,
- suppress_kineto_output: bool = False,
- trace_path: Optional[str] = None,
- barrier_comm_profiling: bool = False,
- num_kernels_per_period: int = 1,
- nemo_automodel.components.moe.uccl_ep._utils.initialize_uccl(
- scratch_ptr,
- scratch_nbytes,
- rank,
- num_ranks,
- group,
- num_experts=0,
- is_intranode=False,
- use_normal_mode=False,
- rdma_buffer_is_host_allocated=False,
- nemo_automodel.components.moe.uccl_ep._utils.destroy_uccl(proxies, workers)#
- nemo_automodel.components.moe.uccl_ep._utils.per_token_cast_to_fp8(x: torch.Tensor)#
- nemo_automodel.components.moe.uccl_ep._utils.create_grouped_scores(
- scores: torch.Tensor,
- group_idx: torch.Tensor,
- num_groups: int,
- nemo_automodel.components.moe.uccl_ep._utils.inplace_unique(x: torch.Tensor, num_slots: int)#