core.nccl_allocator#
Module Contents#
Classes#
An NCCL memory allocator, which inherits APEX nccl_allocator implementation. |
|
A custom allocator class that registers a single memory pool with multiple communication groups. |
Functions#
Get the argument names of a function. |
|
Create a memory pool using the NCCL allocator. |
|
Initialize the NCCL allocator. |
Data#
API#
- core.nccl_allocator._allocator#
None
- core.nccl_allocator._build_nccl_allocator()#
- core.nccl_allocator.get_func_args(func)#
Get the argument names of a function.
- core.nccl_allocator.create_nccl_mem_pool(symmetric=None)#
Create a memory pool using the NCCL allocator.
- core.nccl_allocator.init() None#
Initialize the NCCL allocator.
PyTorch tracks memory registration at the pool level, not per allocation. If a pool already contains allocations from a previous context, attempting to register it again will re-register all existing allocations and may trigger NCCL errors. To avoid this, the pool is explicitly deregistered on entry and re-registered on exit for each context use.
- class core.nccl_allocator.nccl_mem(pool, enabled=True, device=None, group=None, symmetric=True)#
An NCCL memory allocator, which inherits APEX nccl_allocator implementation.
Initialization
- __enter__()#
- __exit__(*args)#
- class core.nccl_allocator.MultiGroupMemPoolAllocator(pool, groups, symmetric=True)#
A custom allocator class that registers a single memory pool with multiple communication groups.
Use cases:
[FSDP+EP] In case of FSDP with EP, expert layer (expert-dp) and non-expert layer (dp) use different communicator groups. The same memory pool has to be registered to both the groups.
[Hybrid FSDP/DP] In case of Hybrid FSDP/DP, there are inter-dp group and intra-dp group. The same memory pool has to be registered to both the groups.
[Hybrid FSDP/DP + EP] In case of Hybrid FSDP/DP + EP, there are inter-dp, intra-dp, and expert-dp groups. The same memory pool has to be registered to all the groups.
.. rubric:: Example
import megatron.core.nccl_allocator as nccl_allocator nccl_allocator.init() pool = nccl_allocator.create_nccl_mem_pool() group_1 = torch.distributed.new_group(ranks=[0, 1, 2, 3, 4, 5, 6, 7], backend="nccl") group_2 = torch.distributed.new_group(ranks=[0, 2, 4, 6], backend="nccl") with MultiGroupMemPoolAllocator(pool, [group_1, group_2]): a = torch.zeros(1024, dtype=torch.float32, device="cuda") b = torch.zeros(1024, dtype=torch.float32, device="cuda")
Initialization
- __enter__()#
- __exit__(*args)#