nemo_automodel.components.moe.uccl_ep._utils#

Module Contents#

Classes#

EventOverlap

A wrapper class to manage CUDA events, also for better overlapping convenience.

empty_suppress

suppress_stdout_stderr

Functions#

API#

nemo_automodel.components.moe.uccl_ep._utils.calc_diff(x: torch.Tensor, y: torch.Tensor)#
nemo_automodel.components.moe.uccl_ep._utils.hash_tensor(t: torch.Tensor)#
nemo_automodel.components.moe.uccl_ep._utils.init_dist(local_rank: int, num_local_ranks: int)#
nemo_automodel.components.moe.uccl_ep._utils.init_dist_under_torchrun(local_rank: int, num_local_ranks: int)#
nemo_automodel.components.moe.uccl_ep._utils._gather_peer_ips(group)#
nemo_automodel.components.moe.uccl_ep._utils.get_peer_ip(
rank: int,
num_ranks: int,
group: torch.distributed.ProcessGroup,
)#
nemo_automodel.components.moe.uccl_ep._utils.get_cpu_proxies_meta(
proxies,
rank,
scratch_ptr,
scratch_bytes,
num_ranks,
group,
)#

Check NVLink connection between every pair of GPUs.

Parameters:

group – the communication group.

class nemo_automodel.components.moe.uccl_ep._utils.EventOverlap(
event: Optional[uccl.ep.EventHandle] = None,
extra_tensors: Optional[Tuple[torch.Tensor]] = None,
)#

A wrapper class to manage CUDA events, also for better overlapping convenience.

.. attribute:: event

the CUDA event captured.

.. attribute:: extra_tensors

an easier way to simulate PyTorch tensor record_stream, may be useful with CUDA graph.

Initialization

Initialize the class.

Parameters:
  • event – the CUDA event captured.

  • extra_tensors – an easier way to simulate PyTorch tensor record_stream, may be useful with CUDA graph.

current_stream_wait() None#

The current stream torch.cuda.current_stream() waits for the event to be finished.

__enter__() Any#

Utility for overlapping and Python with syntax.

You can overlap the kernels on the current stream with the following example:

event_overlap = event_after_all_to_all_kernels()
with event_overlap():
    do_something_on_current_stream()
# After exiting the `with` scope, the current stream with wait the event to be finished.
__exit__(
exc_type: Any,
exc_val: Any,
exc_tb: Any,
) None#

Utility for overlapping and Python with syntax.

Please follow the example in the __enter__ function.

nemo_automodel.components.moe.uccl_ep._utils.detect_ib_hca()#

Detect InfiniBand/RDMA HCA device.

Returns the first RDMA device name found (mlx5 for Mellanox, irdma for Intel), or None if no InfiniBand devices are available.

nemo_automodel.components.moe.uccl_ep._utils.per_token_cast_back(x_fp8: torch.Tensor, x_scales: torch.Tensor)#
class nemo_automodel.components.moe.uccl_ep._utils.empty_suppress#
__enter__()#
__exit__(*_)#
class nemo_automodel.components.moe.uccl_ep._utils.suppress_stdout_stderr#
__enter__()#
__exit__(*_)#
nemo_automodel.components.moe.uccl_ep._utils.bench(fn, num_warmups: int = 50, num_tests: int = 50, post_fn=None)#
nemo_automodel.components.moe.uccl_ep._utils.bench_kineto(
fn,
kernel_names: Union[str, tuple],
num_tests: int = 30,
suppress_kineto_output: bool = False,
trace_path: Optional[str] = None,
barrier_comm_profiling: bool = False,
num_kernels_per_period: int = 1,
)#
nemo_automodel.components.moe.uccl_ep._utils.initialize_uccl(
scratch_ptr,
scratch_nbytes,
rank,
num_ranks,
group,
num_experts=0,
is_intranode=False,
use_normal_mode=False,
rdma_buffer_is_host_allocated=False,
)#
nemo_automodel.components.moe.uccl_ep._utils.destroy_uccl(proxies, workers)#
nemo_automodel.components.moe.uccl_ep._utils.per_token_cast_to_fp8(x: torch.Tensor)#
nemo_automodel.components.moe.uccl_ep._utils.create_grouped_scores(
scores: torch.Tensor,
group_idx: torch.Tensor,
num_groups: int,
)#
nemo_automodel.components.moe.uccl_ep._utils.inplace_unique(x: torch.Tensor, num_slots: int)#