`nemo_automodel.components.moe.uccl_ep._utils`#

Module Contents#

Classes#

`EventOverlap`	A wrapper class to manage CUDA events, also for better overlapping convenience.
`empty_suppress`
`suppress_stdout_stderr`

Functions#

`calc_diff`
`hash_tensor`
`init_dist`
`init_dist_under_torchrun`
`_gather_peer_ips`
`get_peer_ip`
`get_cpu_proxies_meta`
`check_nvlink_connections`	Check NVLink connection between every pair of GPUs.
`detect_ib_hca`	Detect InfiniBand/RDMA HCA device.
`per_token_cast_back`
`bench`
`bench_kineto`
`initialize_uccl`
`destroy_uccl`
`per_token_cast_to_fp8`
`create_grouped_scores`
`inplace_unique`

API#

nemo_automodel.components.moe.uccl_ep._utils.calc_diff(x: torch.Tensor, y: torch.Tensor)#

nemo_automodel.components.moe.uccl_ep._utils.hash_tensor(t: torch.Tensor)#

nemo_automodel.components.moe.uccl_ep._utils.init_dist(local_rank: int, num_local_ranks: int)#

nemo_automodel.components.moe.uccl_ep._utils.init_dist_under_torchrun(local_rank: int, num_local_ranks: int)#

nemo_automodel.components.moe.uccl_ep._utils._gather_peer_ips(group)#

nemo_automodel.components.moe.uccl_ep._utils.get_peer_ip( rank: int, num_ranks: int, group: torch.distributed.ProcessGroup, )#

nemo_automodel.components.moe.uccl_ep._utils.get_cpu_proxies_meta( proxies, rank, scratch_ptr, scratch_bytes, num_ranks, group, )#

nemo_automodel.components.moe.uccl_ep._utils.check_nvlink_connections(group: torch.distributed.ProcessGroup)#

Check NVLink connection between every pair of GPUs.

Parameters:: group – the communication group.

class nemo_automodel.components.moe.uccl_ep._utils.EventOverlap( event: Optional[uccl.ep.EventHandle] = None, extra_tensors: Optional[Tuple[torch.Tensor]] = None, )#

A wrapper class to manage CUDA events, also for better overlapping convenience.

.. attribute:: event

the CUDA event captured.

.. attribute:: extra_tensors

an easier way to simulate PyTorch tensor record_stream, may be useful with CUDA graph.

Initialization

Initialize the class.

Parameters:

event – the CUDA event captured.
extra_tensors – an easier way to simulate PyTorch tensor record_stream, may be useful with CUDA graph.

current_stream_wait() → None#: The current stream torch.cuda.current_stream() waits for the event to be finished.

__enter__() → Any#

Utility for overlapping and Python with syntax.

You can overlap the kernels on the current stream with the following example:

event_overlap = event_after_all_to_all_kernels()
with event_overlap():
    do_something_on_current_stream()
# After exiting the `with` scope, the current stream with wait the event to be finished.

__exit__( exc_type: Any, exc_val: Any, exc_tb: Any, ) → None#

Utility for overlapping and Python with syntax.

Please follow the example in the __enter__ function.

nemo_automodel.components.moe.uccl_ep._utils.detect_ib_hca()#

Detect InfiniBand/RDMA HCA device.

Returns the first RDMA device name found (mlx5 for Mellanox, irdma for Intel), or None if no InfiniBand devices are available.

nemo_automodel.components.moe.uccl_ep._utils.per_token_cast_back(x_fp8: torch.Tensor, x_scales: torch.Tensor)#

class nemo_automodel.components.moe.uccl_ep._utils.empty_suppress#

__enter__()#

__exit__(*_)#

class nemo_automodel.components.moe.uccl_ep._utils.suppress_stdout_stderr#

__enter__()#

__exit__(*_)#

nemo_automodel.components.moe.uccl_ep._utils.bench(fn, num_warmups: int = 50, num_tests: int = 50, post_fn=None)#

nemo_automodel.components.moe.uccl_ep._utils.bench_kineto( fn, kernel_names: Union[str, tuple], num_tests: int = 30, suppress_kineto_output: bool = False, trace_path: Optional[str] = None, barrier_comm_profiling: bool = False, num_kernels_per_period: int = 1, )#

nemo_automodel.components.moe.uccl_ep._utils.initialize_uccl( scratch_ptr, scratch_nbytes, rank, num_ranks, group, num_experts=0, is_intranode=False, use_normal_mode=False, rdma_buffer_is_host_allocated=False, )#

nemo_automodel.components.moe.uccl_ep._utils.destroy_uccl(proxies, workers)#

nemo_automodel.components.moe.uccl_ep._utils.per_token_cast_to_fp8(x: torch.Tensor)#

nemo_automodel.components.moe.uccl_ep._utils.create_grouped_scores( scores: torch.Tensor, group_idx: torch.Tensor, num_groups: int, )#

nemo_automodel.components.moe.uccl_ep._utils.inplace_unique(x: torch.Tensor, num_slots: int)#

nemo_automodel.components.moe.uccl_ep._utils#

Module Contents#

Classes#

Functions#

API#

`nemo_automodel.components.moe.uccl_ep._utils`#