`nemo_deploy.llm.inference.tron_utils`#

Module Contents#

Classes#

`RNGConfig`	Configuration settings for random number generation.
`DistributedInitConfig`	Configuration settings for distributed training initialization.

Functions#

`get_rank_safe`	Get the rank from torch.distributed or environment variable.
`get_world_size_safe`	Get the world size from torch.distributed or environment variable.
`get_local_rank_preinit`	Get the local rank from the environment variable, intended for use before full init.
`print_rank_0`	Print a message only on global rank 0.
`torch_distributed_init`	Initialize torch.distributed using a TCP init method and env-provided ranks.
`initialize_distributed`	Initialize core model parallel.
`_set_random_seed`	Set random seed for reproducability.
`_initialize_tp_communicators`	Initialize communicators with user buffers for high-performance tensor-model-parallel communication overlap.
`_get_model_type`	Determine the model type from the model configuration.
`get_model_from_config`	Get a model from the given configuration.

Data#

LOGGER

API#

nemo_deploy.llm.inference.tron_utils.LOGGER = 'getLogger(...)'#

class nemo_deploy.llm.inference.tron_utils.RNGConfig#

Configuration settings for random number generation.

seed: int = 1234#: Random seed used for python, numpy, pytorch, and cuda.

te_rng_tracker: bool = False#: Use the Transformer Engine version of the random number generator. Required for CUDA graphs support.

inference_rng_tracker: bool = False#: Use a random number generator configured for inference.

data_parallel_random_init: bool = False#: Enable random initialization of params across data parallel ranks

class nemo_deploy.llm.inference.tron_utils.DistributedInitConfig#

Configuration settings for distributed training initialization.

distributed_backend: Literal[nccl, gloo] = 'nccl'#: Which backend to use for distributed training.

distributed_timeout_minutes: int = 10#: Timeout minutes for torch.distributed.

align_grad_reduce: bool = True#: If not set, all PP stages will launch gradient reduces simultaneously. Otherwise, each PP stage will independently launch as needed.

local_rank: int = 'field(...)'#: local rank passed from distributed launcher.

lazy_mpu_init: bool = False#: If set to True, initialize_megatron() skips DDP initialization and returns function to complete it instead. Also turns on –use-cpu-initialization flag. This is for external DDP manager.

use_torch_fsdp2: bool = False#: Use the torch FSDP2 implementation. FSDP2 is not currently working with Pipeline Parallel. It is still not in a stable release stage, and may therefore contain bugs or other potential issues.

nccl_communicator_config_path: Optional[str] = None#: Path to the yaml file with NCCL communicator configurations. The number of min/max thread groups and thread group cluster size of each communicator can be configured by setting min_ctas, max_ctas, and cga_cluster_size.

use_tp_pp_dp_mapping: bool = False#: If set, distributed ranks initialize order is changed from tp-dp-pp to tp-pp-dp. Make sure EP and CP aren’t used with this option enabled.

use_gloo_process_groups: bool = True#: If set, create Gloo process groups for communications.

nemo_deploy.llm.inference.tron_utils.get_rank_safe() → int#

Get the rank from torch.distributed or environment variable.

Returns:: The global rank of the current process.
Return type:: int

nemo_deploy.llm.inference.tron_utils.get_world_size_safe() → int#

Get the world size from torch.distributed or environment variable.

Returns:: The total number of processes in the distributed setup.
Return type:: int

nemo_deploy.llm.inference.tron_utils.get_local_rank_preinit() → int#

Get the local rank from the environment variable, intended for use before full init.

Returns:: The local rank of the current process.
Return type:: int

nemo_deploy.llm.inference.tron_utils.print_rank_0(message: str) → None#

Print a message only on global rank 0.

Parameters:: message (str) – The message string to print.

nemo_deploy.llm.inference.tron_utils.torch_distributed_init( dist_config: nemo_deploy.llm.inference.tron_utils.DistributedInitConfig, )#

Initialize torch.distributed using a TCP init method and env-provided ranks.

This function is idempotent: if torch.distributed is already initialized it logs and returns. Otherwise, it sets the current CUDA device based on LOCAL_RANK (when GPUs are available), constructs the TCP init_method from MASTER_ADDR and MASTER_PORT, and initializes the process group with the backend and timeout specified in dist_config. After init, it issues a barrier scoped to the current device.

Parameters:: dist_config (DistributedInitConfig) – Configuration including backend and timeout used for the process group initialization.

Environment: - MASTER_ADDR: Master node address (default: “localhost”). - MASTER_PORT: Master node port (default: “6000”). - WORLD_SIZE: Total number of ranks (default: “1”). - RANK: Global rank of this process (default: “0”). - LOCAL_RANK: Local rank on the node/GPU index (default: “0”).

nemo_deploy.llm.inference.tron_utils.initialize_distributed( model_config: Union[nemo.collections.llm.gpt.model.base.GPTConfig, nemo.collections.llm.t5.model.t5.T5Config], dist_config: nemo_deploy.llm.inference.tron_utils.DistributedInitConfig, num_distributed_optimizer_instances: int, get_embedding_ranks: Optional[Callable[[List[int], Optional[int]], List[int]]], get_position_embedding_ranks: Optional[Callable[[List[int], Optional[int]], List[int]]], ) → None#

Initialize core model parallel.

Parameters:

model_config (Union[GPTConfig, T5Config]) – Configuration for the model architecture
dist_config (DistributedInitConfig) – Configuration for distributed initialization
num_distributed_optimizer_instances (int) – Number of optimizer instances for distributed training
get_embedding_ranks (Optional[Callable[[List[int], Optional[int]], List[int]]]) – Function to get the ranks for embedding parallel
get_position_embedding_ranks (Optional[Callable[[List[int], Optional[int]], List[int]]]) – Function to get the ranks for position embedding parallel

nemo_deploy.llm.inference.tron_utils._set_random_seed( seed_: int, data_parallel_random_init: bool = False, te_rng_tracker: bool = False, inference_rng_tracker: bool = False, ) → None#

Set random seed for reproducability.

Parameters:

seed_ (int) – Base random seed to use
data_parallel_random_init (bool, optional) – Whether to use different seeds for different data parallel ranks. Defaults to False.
te_rng_tracker (bool, optional) – Whether to use Transformer Engine random number generator. Defaults to False.
inference_rng_tracker (bool, optional) – Whether to use a random number generator configured for inference. Defaults to False.

nemo_deploy.llm.inference.tron_utils._initialize_tp_communicators( model_config: Union[nemo.collections.llm.gpt.model.base.GPTConfig, nemo.collections.llm.t5.model.t5.T5Config], micro_batch_size: int, ) → None#

Initialize communicators with user buffers for high-performance tensor-model-parallel communication overlap.

Parameters:

model_config (Union[GPTConfig, T5Config]) – Configuration for the model architecture
micro_batch_size (int) – Size of the micro batch

nemo_deploy.llm.inference.tron_utils._get_model_type( model_config: Union[nemo.collections.llm.gpt.model.base.GPTConfig, nemo.collections.llm.t5.model.t5.T5Config], ) → megatron.core.enums.ModelType#

Determine the model type from the model configuration.

Parameters:: model_config (Union[GPTConfig, T5Config]) – The model configuration object
Returns:: The model type enum value (encoder_and_decoder or encoder_or_decoder)
Return type:: ModelType

nemo_deploy.llm.inference.tron_utils.get_model_from_config( model_config: Union[nemo.collections.llm.gpt.model.base.GPTConfig, nemo.collections.llm.t5.model.t5.T5Config], ddp_config: megatron.core.distributed.DistributedDataParallelConfig, overlap_param_gather_with_optimizer_step: bool = False, wrap_with_ddp: bool = True, data_parallel_random_init: bool = True, tokenizer=None, ) → List[megatron.core.transformer.module.MegatronModule]#

Get a model from the given configuration.

This method should only be called after init_distributed().

Parameters:

model_config (Union[GPTConfig, T5Config]) – The model configuration
ddp_config (DistributedDataParallelConfig) – The distributed data parallel configuration
overlap_param_gather_with_optimizer_step (bool, optional) – Whether to overlap parameter gathering with optimizer step. Defaults to False.
wrap_with_ddp (bool, optional) – Whether to wrap the model with DistributedDataParallel. Defaults to True.
data_parallel_random_init (bool, optional) – Whether to initialize data parallel ranks with random seeds. Defaults to True.
tokenizer (optional) – The tokenizer to pass to configure_model. Defaults to None.

Returns:

List of model modules, potentially wrapped with DistributedDataParallel

Return type:

List[MegatronModule]

nemo_deploy.llm.inference.tron_utils#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_deploy.llm.inference.tron_utils`#