`nemo_deploy.llm.inference.inference_base`#

Module Contents#

Classes#

MCoreEngineWithCleanup

Wrapper around MCoreEngine that ensures proper cleanup of distributed resources.

Functions#

`_load_dist_shards_into_model`	Load a NeMo-2 distributed checkpoint (torch_dist .distcp shards) into an already-constructed Megatron model list.
`cleanup_distributed`	Clean up the distributed environment by destroying the process group.
`initialize_megatron_for_inference`	Initialize the Megatron-Tron runtime components required for inference.
`peel`	Recursively unwrap a wrapped torch.nn.Module and return the underlying module.
`load_nemo_checkpoint_to_tron_model`	Load NeMo checkpoint weights into a Tron model.
`setup_megatron_model_and_tokenizer_for_inference`	Initialize a Megatron model and tokenizer for inference from a Megatron-LM/MBridge checkpoint.
`setup_model_and_tokenizer_for_inference`	Initialize a Megatron-Core model and tokenizer for inference from a NeMo-2.0 checkpoint.
`create_mcore_engine`	Set up the model, tokenizer and MCoreEngine for inference.

Data#

`logger`
`LOGGER`

API#

nemo_deploy.llm.inference.inference_base.logger = 'getLogger(...)'#

nemo_deploy.llm.inference.inference_base.LOGGER = 'getLogger(...)'#

nemo_deploy.llm.inference.inference_base._load_dist_shards_into_model( model: List[megatron.core.transformer.module.MegatronModule], weights_dir: pathlib.Path, legacy_ckpt: bool = False, ) → None#

Load a NeMo-2 distributed checkpoint (torch_dist .distcp shards) into an already-constructed Megatron model list.

Parameters:

model (List[MegatronModule]) – The list of Megatron model modules
weights_dir (Path) – Path to the weights directory containing shards
legacy_ckpt (bool) – Whether to use legacy checkpoint format

nemo_deploy.llm.inference.inference_base.cleanup_distributed() → None#

Clean up the distributed environment by destroying the process group.

This prevents resource leaks and warnings about destroy_process_group() not being called.

nemo_deploy.llm.inference.inference_base.initialize_megatron_for_inference( model_config, dist_config: nemo_deploy.llm.inference.tron_utils.DistributedInitConfig, rng_config: nemo_deploy.llm.inference.tron_utils.RNGConfig, micro_batch_size: int, ) → None#

Initialize the Megatron-Tron runtime components required for inference.

Parameters:

model_config – The model configuration object that specifies tensor/pipeline parallel sizes and model architecture details
dist_config (DistributedInitConfig) – Distributed launcher configuration that controls torch.distributed process groups
rng_config (RNGConfig) – Configuration for random number generation behavior and seed
micro_batch_size (int) – The micro batch size used during model execution

nemo_deploy.llm.inference.inference_base.peel(m: torch.nn.Module) → torch.nn.Module#

Recursively unwrap a wrapped torch.nn.Module and return the underlying module.

Parameters:: m (torch.nn.Module) – The (possibly wrapped) PyTorch module
Returns:: The innermost unwrapped module
Return type:: torch.nn.Module

nemo_deploy.llm.inference.inference_base.load_nemo_checkpoint_to_tron_model( model: List[megatron.core.transformer.module.MegatronModule], path: pathlib.Path, legacy_ckpt: bool = False, ) → None#

Load NeMo checkpoint weights into a Tron model.

Parameters:

model (List[MegatronModule]) – Tron model modules list (from get_model_from_config)
path (Path) – Path to NeMo checkpoint directory
legacy_ckpt (bool) – Whether to use legacy checkpoint format

nemo_deploy.llm.inference.inference_base.setup_megatron_model_and_tokenizer_for_inference( checkpoint_path: Union[str, pathlib.Path], tensor_model_parallel_size: Optional[int] = None, pipeline_model_parallel_size: Optional[int] = None, context_parallel_size: Optional[int] = None, expert_model_parallel_size: Optional[int] = None, micro_batch_size: Optional[int] = None, model_type: str = 'gpt', ) → Tuple[List[megatron.core.transformer.module.MegatronModule], megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer]#

Initialize a Megatron model and tokenizer for inference from a Megatron-LM/MBridge checkpoint.

This function initializes torch.distributed (NCCL), applies requested parallel sizes on top of values stored in the checkpoint, sets up the Megatron runtime for inference, builds the model, and loads the corresponding tokenizer.

Parameters:

checkpoint_path (Union[str, Path]) – Path to the Megatron-LM checkpoint directory or file.
tensor_model_parallel_size (Optional[int]) – Desired tensor-parallel world size. Defaults to the value stored in the checkpoint when not provided.
pipeline_model_parallel_size (Optional[int]) – Desired pipeline-parallel world size. Defaults to the checkpoint value when not provided.
context_parallel_size (Optional[int]) – Desired context-parallel world size. Defaults to the checkpoint value when not provided.
expert_model_parallel_size (Optional[int]) – Desired expert-parallel world size. Defaults to the checkpoint value when not provided.
micro_batch_size (Optional[int]) – Micro-batch size to use during runtime initialization.
model_type (str) – Model family to build (for example, “gpt”).

Returns:

List of instantiated Megatron modules (virtual pipeline when applicable)
Tokenizer instance compatible with the model
Additional Megatron-LM args loaded from the checkpoint (mlm_args)

Return type:

Tuple[List[MegatronModule], MegatronTokenizer, Any]

nemo_deploy.llm.inference.inference_base.setup_model_and_tokenizer_for_inference(

checkpoint_path: Union[str, pathlib.Path],

tensor_model_parallel_size: Optional[int] = None,

pipeline_model_parallel_size: Optional[int] = None,

context_parallel_size: Optional[int] = None,

expert_model_parallel_size: Optional[int] = None,

params_dtype: Optional[torch.dtype] = None,

micro_batch_size: Optional[int] = None,

enable_flash_decode: bool = False,

enable_cuda_graphs: bool = False,

legacy_ckpt: bool = False,

**model_config_kwargs,

) → Tuple[List[megatron.core.transformer.module.MegatronModule], nemo.collections.llm.inference.base.MCoreTokenizerWrappper]#

Initialize a Megatron-Core model and tokenizer for inference from a NeMo-2.0 checkpoint.

Parameters:

checkpoint_path (Union[str, Path]) – Path to the NeMo checkpoint directory
tensor_model_parallel_size (Optional[int]) – Desired tensor-parallel world size (defaults to checkpoint value)
pipeline_model_parallel_size (Optional[int]) – Desired pipeline-parallel world size (defaults to checkpoint value)
context_parallel_size (Optional[int]) – Desired context-parallel world size (defaults to checkpoint value)
expert_model_parallel_size (Optional[int]) – Desired expert parallel world size (defaults to checkpoint value)
params_dtype (Optional[torch.dtype]) – Data type for model parameters (defaults to checkpoint dtype)
micro_batch_size (Optional[int]) – Micro batch size for model execution (defaults to 1)
enable_flash_decode (bool) – Whether to enable flash attention decoding
enable_cuda_graphs (bool) – Whether to enable CUDA graphs optimization
legacy_ckpt (bool) – Whether to use legacy checkpoint format

Returns:

Tuple containing: - List of instantiated Megatron-Core modules - Tokenizer wrapper with encode/decode interface

Return type:

Tuple[List[MegatronModule], MCoreTokenizerWrappper]

Raises:

ValueError – If checkpoint_path is not a valid NeMo-2.0 checkpoint

class nemo_deploy.llm.inference.inference_base.MCoreEngineWithCleanup( mcore_engine: megatron.core.inference.engines.mcore_engine.MCoreEngine, model_inference_wrapper: megatron.core.inference.model_inference_wrappers.gpt.gpt_inference_wrapper.GPTInferenceWrapper, tokenizer: Union[nemo.collections.llm.inference.base.MCoreTokenizerWrappper, megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer], )#

Wrapper around MCoreEngine that ensures proper cleanup of distributed resources.

This class delegates all operations to the underlying MCoreEngine while ensuring that distributed resources are properly cleaned up when the engine is destroyed.

Initialization

Initialize the MCoreEngineWithCleanup.

Parameters:

mcore_engine (MCoreEngine) – The underlying MCoreEngine instance
model_inference_wrapper (GPTInferenceWrapper) – The model inference wrapper
tokenizer (Union[MCoreTokenizerWrappper, MegatronTokenizer]) – The tokenizer instance

__del__()#

__getattr__(name)#

nemo_deploy.llm.inference.inference_base.create_mcore_engine(

path: pathlib.Path,

num_devices: Optional[int] = None,

num_nodes: Optional[int] = None,

params_dtype: torch.dtype = torch.bfloat16,

inference_batch_times_seqlen_threshold: int = 32768,

inference_max_seq_length: int = 4096,

max_batch_size: int = 8,

random_seed: Optional[int] = None,

tensor_model_parallel_size: Optional[int] = None,

pipeline_model_parallel_size: Optional[int] = None,

context_parallel_size: Optional[int] = None,

expert_model_parallel_size: Optional[int] = None,

enable_flash_decode: bool = False,

enable_cuda_graphs: bool = False,

legacy_ckpt: bool = False,

model_type: str = 'gpt',

model_format: str = 'nemo',

micro_batch_size: Optional[int] = None,

**model_config_kwargs,

) → Tuple[nemo_deploy.llm.inference.inference_base.MCoreEngineWithCleanup, megatron.core.inference.model_inference_wrappers.gpt.gpt_inference_wrapper.GPTInferenceWrapper, Union[nemo.collections.llm.inference.base.MCoreTokenizerWrappper, megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer]]#

Set up the model, tokenizer and MCoreEngine for inference.

Parameters:

path (Path) – Path to the checkpoint file
params_dtype (torch.dtype) – Data type for model parameters (default: torch.bfloat16)
inference_batch_times_seqlen_threshold (int) – Threshold for batch size times sequence length
inference_max_seq_length (int) – Maximum sequence length for inference
max_batch_size (int) – Maximum batch size for inference
random_seed (Optional[int]) – Random seed for reproducibility
tensor_model_parallel_size (Optional[int]) – Size of tensor model parallelism
pipeline_model_parallel_size (Optional[int]) – Size of pipeline model parallelism
context_parallel_size (Optional[int]) – Size of context parallelism
expert_model_parallel_size (Optional[int]) – Size of expert model parallelism
enable_flash_decode (bool) – Whether to enable flash attention decoding
enable_cuda_graphs (bool) – Whether to enable CUDA graphs optimization
legacy_ckpt (bool) – Whether to use legacy checkpoint format
model_type (str) – Type of model to load (default: “gpt”)
model_format (str) – Format of model to load (default: “nemo”)
micro_batch_size (Optional[int]) – Micro batch size for model execution

Returns:

Tuple containing: - MCoreEngineWithCleanup: Engine for text generation with proper cleanup - GPTInferenceWrapper: Inference-wrapped model - Union[MCoreTokenizerWrappper, MegatronTokenizer]: Tokenizer instance

Return type:

Tuple[MCoreEngineWithCleanup, GPTInferenceWrapper, Union[MCoreTokenizerWrappper, MegatronTokenizer]]

nemo_deploy.llm.inference.inference_base#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_deploy.llm.inference.inference_base`#