nemo_deploy.llm.inference.inference_base#

Module Contents#

Classes#

MCoreEngineWithCleanup

Wrapper around MCoreEngine that ensures proper cleanup of distributed resources.

Functions#

_load_dist_shards_into_model

Load a NeMo-2 distributed checkpoint (torch_dist .distcp shards) into an already-constructed Megatron model list.

cleanup_distributed

Clean up the distributed environment by destroying the process group.

initialize_megatron_for_inference

Initialize the Megatron-Tron runtime components required for inference.

peel

Recursively unwrap a wrapped torch.nn.Module and return the underlying module.

load_nemo_checkpoint_to_tron_model

Load NeMo checkpoint weights into a Tron model.

setup_megatron_model_and_tokenizer_for_inference

Initialize a Megatron model and tokenizer for inference from a Megatron-LM/MBridge checkpoint.

setup_model_and_tokenizer_for_inference

Initialize a Megatron-Core model and tokenizer for inference from a NeMo-2.0 checkpoint.

create_mcore_engine

Set up the model, tokenizer and MCoreEngine for inference.

Data#

API#

nemo_deploy.llm.inference.inference_base.logger = 'getLogger(...)'#
nemo_deploy.llm.inference.inference_base.LOGGER = 'getLogger(...)'#
nemo_deploy.llm.inference.inference_base._load_dist_shards_into_model(
model: List[megatron.core.transformer.module.MegatronModule],
weights_dir: pathlib.Path,
legacy_ckpt: bool = False,
) None#

Load a NeMo-2 distributed checkpoint (torch_dist .distcp shards) into an already-constructed Megatron model list.

Parameters:
  • model (List[MegatronModule]) – The list of Megatron model modules

  • weights_dir (Path) – Path to the weights directory containing shards

  • legacy_ckpt (bool) – Whether to use legacy checkpoint format

nemo_deploy.llm.inference.inference_base.cleanup_distributed() None#

Clean up the distributed environment by destroying the process group.

This prevents resource leaks and warnings about destroy_process_group() not being called.

nemo_deploy.llm.inference.inference_base.initialize_megatron_for_inference(
model_config,
dist_config: nemo_deploy.llm.inference.tron_utils.DistributedInitConfig,
rng_config: nemo_deploy.llm.inference.tron_utils.RNGConfig,
micro_batch_size: int,
) None#

Initialize the Megatron-Tron runtime components required for inference.

Parameters:
  • model_config – The model configuration object that specifies tensor/pipeline parallel sizes and model architecture details

  • dist_config (DistributedInitConfig) – Distributed launcher configuration that controls torch.distributed process groups

  • rng_config (RNGConfig) – Configuration for random number generation behavior and seed

  • micro_batch_size (int) – The micro batch size used during model execution

nemo_deploy.llm.inference.inference_base.peel(m: torch.nn.Module) torch.nn.Module#

Recursively unwrap a wrapped torch.nn.Module and return the underlying module.

Parameters:

m (torch.nn.Module) – The (possibly wrapped) PyTorch module

Returns:

The innermost unwrapped module

Return type:

torch.nn.Module

nemo_deploy.llm.inference.inference_base.load_nemo_checkpoint_to_tron_model(
model: List[megatron.core.transformer.module.MegatronModule],
path: pathlib.Path,
legacy_ckpt: bool = False,
) None#

Load NeMo checkpoint weights into a Tron model.

Parameters:
  • model (List[MegatronModule]) – Tron model modules list (from get_model_from_config)

  • path (Path) – Path to NeMo checkpoint directory

  • legacy_ckpt (bool) – Whether to use legacy checkpoint format

nemo_deploy.llm.inference.inference_base.setup_megatron_model_and_tokenizer_for_inference(
checkpoint_path: Union[str, pathlib.Path],
tensor_model_parallel_size: Optional[int] = None,
pipeline_model_parallel_size: Optional[int] = None,
context_parallel_size: Optional[int] = None,
expert_model_parallel_size: Optional[int] = None,
micro_batch_size: Optional[int] = None,
model_type: str = 'gpt',
) Tuple[List[megatron.core.transformer.module.MegatronModule], megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer]#

Initialize a Megatron model and tokenizer for inference from a Megatron-LM/MBridge checkpoint.

This function initializes torch.distributed (NCCL), applies requested parallel sizes on top of values stored in the checkpoint, sets up the Megatron runtime for inference, builds the model, and loads the corresponding tokenizer.

Parameters:
  • checkpoint_path (Union[str, Path]) – Path to the Megatron-LM checkpoint directory or file.

  • tensor_model_parallel_size (Optional[int]) – Desired tensor-parallel world size. Defaults to the value stored in the checkpoint when not provided.

  • pipeline_model_parallel_size (Optional[int]) – Desired pipeline-parallel world size. Defaults to the checkpoint value when not provided.

  • context_parallel_size (Optional[int]) – Desired context-parallel world size. Defaults to the checkpoint value when not provided.

  • expert_model_parallel_size (Optional[int]) – Desired expert-parallel world size. Defaults to the checkpoint value when not provided.

  • micro_batch_size (Optional[int]) – Micro-batch size to use during runtime initialization.

  • model_type (str) – Model family to build (for example, “gpt”).

Returns:

  • List of instantiated Megatron modules (virtual pipeline when applicable)

  • Tokenizer instance compatible with the model

  • Additional Megatron-LM args loaded from the checkpoint (mlm_args)

Return type:

Tuple[List[MegatronModule], MegatronTokenizer, Any]

nemo_deploy.llm.inference.inference_base.setup_model_and_tokenizer_for_inference(
checkpoint_path: Union[str, pathlib.Path],
tensor_model_parallel_size: Optional[int] = None,
pipeline_model_parallel_size: Optional[int] = None,
context_parallel_size: Optional[int] = None,
expert_model_parallel_size: Optional[int] = None,
params_dtype: Optional[torch.dtype] = None,
micro_batch_size: Optional[int] = None,
enable_flash_decode: bool = False,
enable_cuda_graphs: bool = False,
legacy_ckpt: bool = False,
**model_config_kwargs,
) Tuple[List[megatron.core.transformer.module.MegatronModule], nemo.collections.llm.inference.base.MCoreTokenizerWrappper]#

Initialize a Megatron-Core model and tokenizer for inference from a NeMo-2.0 checkpoint.

Parameters:
  • checkpoint_path (Union[str, Path]) – Path to the NeMo checkpoint directory

  • tensor_model_parallel_size (Optional[int]) – Desired tensor-parallel world size (defaults to checkpoint value)

  • pipeline_model_parallel_size (Optional[int]) – Desired pipeline-parallel world size (defaults to checkpoint value)

  • context_parallel_size (Optional[int]) – Desired context-parallel world size (defaults to checkpoint value)

  • expert_model_parallel_size (Optional[int]) – Desired expert parallel world size (defaults to checkpoint value)

  • params_dtype (Optional[torch.dtype]) – Data type for model parameters (defaults to checkpoint dtype)

  • micro_batch_size (Optional[int]) – Micro batch size for model execution (defaults to 1)

  • enable_flash_decode (bool) – Whether to enable flash attention decoding

  • enable_cuda_graphs (bool) – Whether to enable CUDA graphs optimization

  • legacy_ckpt (bool) – Whether to use legacy checkpoint format

Returns:

Tuple containing: - List of instantiated Megatron-Core modules - Tokenizer wrapper with encode/decode interface

Return type:

Tuple[List[MegatronModule], MCoreTokenizerWrappper]

Raises:

ValueError – If checkpoint_path is not a valid NeMo-2.0 checkpoint

class nemo_deploy.llm.inference.inference_base.MCoreEngineWithCleanup(
mcore_engine: megatron.core.inference.engines.mcore_engine.MCoreEngine,
model_inference_wrapper: megatron.core.inference.model_inference_wrappers.gpt.gpt_inference_wrapper.GPTInferenceWrapper,
tokenizer: Union[nemo.collections.llm.inference.base.MCoreTokenizerWrappper, megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer],
)#

Wrapper around MCoreEngine that ensures proper cleanup of distributed resources.

This class delegates all operations to the underlying MCoreEngine while ensuring that distributed resources are properly cleaned up when the engine is destroyed.

Initialization

Initialize the MCoreEngineWithCleanup.

Parameters:
  • mcore_engine (MCoreEngine) – The underlying MCoreEngine instance

  • model_inference_wrapper (GPTInferenceWrapper) – The model inference wrapper

  • tokenizer (Union[MCoreTokenizerWrappper, MegatronTokenizer]) – The tokenizer instance

__del__()#
__getattr__(name)#
nemo_deploy.llm.inference.inference_base.create_mcore_engine(
path: pathlib.Path,
num_devices: Optional[int] = None,
num_nodes: Optional[int] = None,
params_dtype: torch.dtype = torch.bfloat16,
inference_batch_times_seqlen_threshold: int = 32768,
inference_max_seq_length: int = 4096,
max_batch_size: int = 8,
random_seed: Optional[int] = None,
tensor_model_parallel_size: Optional[int] = None,
pipeline_model_parallel_size: Optional[int] = None,
context_parallel_size: Optional[int] = None,
expert_model_parallel_size: Optional[int] = None,
enable_flash_decode: bool = False,
enable_cuda_graphs: bool = False,
legacy_ckpt: bool = False,
model_type: str = 'gpt',
model_format: str = 'nemo',
micro_batch_size: Optional[int] = None,
**model_config_kwargs,
) Tuple[nemo_deploy.llm.inference.inference_base.MCoreEngineWithCleanup, megatron.core.inference.model_inference_wrappers.gpt.gpt_inference_wrapper.GPTInferenceWrapper, Union[nemo.collections.llm.inference.base.MCoreTokenizerWrappper, megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer]]#

Set up the model, tokenizer and MCoreEngine for inference.

Parameters:
  • path (Path) – Path to the checkpoint file

  • params_dtype (torch.dtype) – Data type for model parameters (default: torch.bfloat16)

  • inference_batch_times_seqlen_threshold (int) – Threshold for batch size times sequence length

  • inference_max_seq_length (int) – Maximum sequence length for inference

  • max_batch_size (int) – Maximum batch size for inference

  • random_seed (Optional[int]) – Random seed for reproducibility

  • tensor_model_parallel_size (Optional[int]) – Size of tensor model parallelism

  • pipeline_model_parallel_size (Optional[int]) – Size of pipeline model parallelism

  • context_parallel_size (Optional[int]) – Size of context parallelism

  • expert_model_parallel_size (Optional[int]) – Size of expert model parallelism

  • enable_flash_decode (bool) – Whether to enable flash attention decoding

  • enable_cuda_graphs (bool) – Whether to enable CUDA graphs optimization

  • legacy_ckpt (bool) – Whether to use legacy checkpoint format

  • model_type (str) – Type of model to load (default: “gpt”)

  • model_format (str) – Format of model to load (default: “nemo”)

  • micro_batch_size (Optional[int]) – Micro batch size for model execution

Returns:

Tuple containing: - MCoreEngineWithCleanup: Engine for text generation with proper cleanup - GPTInferenceWrapper: Inference-wrapped model - Union[MCoreTokenizerWrappper, MegatronTokenizer]: Tokenizer instance

Return type:

Tuple[MCoreEngineWithCleanup, GPTInferenceWrapper, Union[MCoreTokenizerWrappper, MegatronTokenizer]]