nemo_deploy.llm.inference.tron_utils#

Module Contents#

Classes#

RNGConfig

Configuration settings for random number generation.

DistributedInitConfig

Configuration settings for distributed training initialization.

Functions#

get_rank_safe

Get the rank from torch.distributed or environment variable.

get_world_size_safe

Get the world size from torch.distributed or environment variable.

get_local_rank_preinit

Get the local rank from the environment variable, intended for use before full init.

print_rank_0

Print a message only on global rank 0.

torch_distributed_init

Initialize torch.distributed using a TCP init method and env-provided ranks.

initialize_distributed

Initialize core model parallel.

_set_random_seed

Set random seed for reproducability.

_initialize_tp_communicators

Initialize communicators with user buffers for high-performance tensor-model-parallel communication overlap.

_get_model_type

Determine the model type from the model configuration.

get_model_from_config

Get a model from the given configuration.

Data#

API#

nemo_deploy.llm.inference.tron_utils.LOGGER = 'getLogger(...)'#
class nemo_deploy.llm.inference.tron_utils.RNGConfig#

Configuration settings for random number generation.

seed: int = 1234#

Random seed used for python, numpy, pytorch, and cuda.

te_rng_tracker: bool = False#

Use the Transformer Engine version of the random number generator. Required for CUDA graphs support.

inference_rng_tracker: bool = False#

Use a random number generator configured for inference.

data_parallel_random_init: bool = False#

Enable random initialization of params across data parallel ranks

class nemo_deploy.llm.inference.tron_utils.DistributedInitConfig#

Configuration settings for distributed training initialization.

distributed_backend: Literal[nccl, gloo] = 'nccl'#

Which backend to use for distributed training.

distributed_timeout_minutes: int = 10#

Timeout minutes for torch.distributed.

align_grad_reduce: bool = True#

If not set, all PP stages will launch gradient reduces simultaneously. Otherwise, each PP stage will independently launch as needed.

local_rank: int = 'field(...)'#

local rank passed from distributed launcher.

lazy_mpu_init: bool = False#

If set to True, initialize_megatron() skips DDP initialization and returns function to complete it instead. Also turns on –use-cpu-initialization flag. This is for external DDP manager.

use_torch_fsdp2: bool = False#

Use the torch FSDP2 implementation. FSDP2 is not currently working with Pipeline Parallel. It is still not in a stable release stage, and may therefore contain bugs or other potential issues.

nccl_communicator_config_path: Optional[str] = None#

Path to the yaml file with NCCL communicator configurations. The number of min/max thread groups and thread group cluster size of each communicator can be configured by setting min_ctas, max_ctas, and cga_cluster_size.

use_tp_pp_dp_mapping: bool = False#

If set, distributed ranks initialize order is changed from tp-dp-pp to tp-pp-dp. Make sure EP and CP aren’t used with this option enabled.

use_gloo_process_groups: bool = True#

If set, create Gloo process groups for communications.

nemo_deploy.llm.inference.tron_utils.get_rank_safe() int#

Get the rank from torch.distributed or environment variable.

Returns:

The global rank of the current process.

Return type:

int

nemo_deploy.llm.inference.tron_utils.get_world_size_safe() int#

Get the world size from torch.distributed or environment variable.

Returns:

The total number of processes in the distributed setup.

Return type:

int

nemo_deploy.llm.inference.tron_utils.get_local_rank_preinit() int#

Get the local rank from the environment variable, intended for use before full init.

Returns:

The local rank of the current process.

Return type:

int

nemo_deploy.llm.inference.tron_utils.print_rank_0(message: str) None#

Print a message only on global rank 0.

Parameters:

message (str) – The message string to print.

nemo_deploy.llm.inference.tron_utils.torch_distributed_init(
dist_config: nemo_deploy.llm.inference.tron_utils.DistributedInitConfig,
)#

Initialize torch.distributed using a TCP init method and env-provided ranks.

This function is idempotent: if torch.distributed is already initialized it logs and returns. Otherwise, it sets the current CUDA device based on LOCAL_RANK (when GPUs are available), constructs the TCP init_method from MASTER_ADDR and MASTER_PORT, and initializes the process group with the backend and timeout specified in dist_config. After init, it issues a barrier scoped to the current device.

Parameters:

dist_config (DistributedInitConfig) – Configuration including backend and timeout used for the process group initialization.

Environment: - MASTER_ADDR: Master node address (default: “localhost”). - MASTER_PORT: Master node port (default: “6000”). - WORLD_SIZE: Total number of ranks (default: “1”). - RANK: Global rank of this process (default: “0”). - LOCAL_RANK: Local rank on the node/GPU index (default: “0”).

nemo_deploy.llm.inference.tron_utils.initialize_distributed(
model_config: Union[nemo.collections.llm.gpt.model.base.GPTConfig, nemo.collections.llm.t5.model.t5.T5Config],
dist_config: nemo_deploy.llm.inference.tron_utils.DistributedInitConfig,
num_distributed_optimizer_instances: int,
get_embedding_ranks: Optional[Callable[[List[int], Optional[int]], List[int]]],
get_position_embedding_ranks: Optional[Callable[[List[int], Optional[int]], List[int]]],
) None#

Initialize core model parallel.

Parameters:
  • model_config (Union[GPTConfig, T5Config]) – Configuration for the model architecture

  • dist_config (DistributedInitConfig) – Configuration for distributed initialization

  • num_distributed_optimizer_instances (int) – Number of optimizer instances for distributed training

  • get_embedding_ranks (Optional[Callable[[List[int], Optional[int]], List[int]]]) – Function to get the ranks for embedding parallel

  • get_position_embedding_ranks (Optional[Callable[[List[int], Optional[int]], List[int]]]) – Function to get the ranks for position embedding parallel

nemo_deploy.llm.inference.tron_utils._set_random_seed(
seed_: int,
data_parallel_random_init: bool = False,
te_rng_tracker: bool = False,
inference_rng_tracker: bool = False,
) None#

Set random seed for reproducability.

Parameters:
  • seed_ (int) – Base random seed to use

  • data_parallel_random_init (bool, optional) – Whether to use different seeds for different data parallel ranks. Defaults to False.

  • te_rng_tracker (bool, optional) – Whether to use Transformer Engine random number generator. Defaults to False.

  • inference_rng_tracker (bool, optional) – Whether to use a random number generator configured for inference. Defaults to False.

nemo_deploy.llm.inference.tron_utils._initialize_tp_communicators(
model_config: Union[nemo.collections.llm.gpt.model.base.GPTConfig, nemo.collections.llm.t5.model.t5.T5Config],
micro_batch_size: int,
) None#

Initialize communicators with user buffers for high-performance tensor-model-parallel communication overlap.

Parameters:
  • model_config (Union[GPTConfig, T5Config]) – Configuration for the model architecture

  • micro_batch_size (int) – Size of the micro batch

nemo_deploy.llm.inference.tron_utils._get_model_type(
model_config: Union[nemo.collections.llm.gpt.model.base.GPTConfig, nemo.collections.llm.t5.model.t5.T5Config],
) megatron.core.enums.ModelType#

Determine the model type from the model configuration.

Parameters:

model_config (Union[GPTConfig, T5Config]) – The model configuration object

Returns:

The model type enum value (encoder_and_decoder or encoder_or_decoder)

Return type:

ModelType

nemo_deploy.llm.inference.tron_utils.get_model_from_config(
model_config: Union[nemo.collections.llm.gpt.model.base.GPTConfig, nemo.collections.llm.t5.model.t5.T5Config],
ddp_config: megatron.core.distributed.DistributedDataParallelConfig,
overlap_param_gather_with_optimizer_step: bool = False,
wrap_with_ddp: bool = True,
data_parallel_random_init: bool = True,
tokenizer=None,
) List[megatron.core.transformer.module.MegatronModule]#

Get a model from the given configuration.

This method should only be called after init_distributed().

Parameters:
  • model_config (Union[GPTConfig, T5Config]) – The model configuration

  • ddp_config (DistributedDataParallelConfig) – The distributed data parallel configuration

  • overlap_param_gather_with_optimizer_step (bool, optional) – Whether to overlap parameter gathering with optimizer step. Defaults to False.

  • wrap_with_ddp (bool, optional) – Whether to wrap the model with DistributedDataParallel. Defaults to True.

  • data_parallel_random_init (bool, optional) – Whether to initialize data parallel ranks with random seeds. Defaults to True.

  • tokenizer (optional) – The tokenizer to pass to configure_model. Defaults to None.

Returns:

List of model modules, potentially wrapped with DistributedDataParallel

Return type:

List[MegatronModule]