nemo_deploy.nlp.megatronllm_deployable#
Module Contents#
Classes#
| A factory class for creating deployable instances of Megatron LLM models. | |
| Triton inference server compatible deploy class for a .nemo model file. | 
Functions#
| Serializes dict to str. | 
Data#
API#
- nemo_deploy.nlp.megatronllm_deployable.LOGGER = 'getLogger(...)'#
- class nemo_deploy.nlp.megatronllm_deployable.MegatronLLMDeploy[source]#
- A factory class for creating deployable instances of Megatron LLM models. - This class provides a method to get the appropriate deployable instance based on the version of the NeMo checkpoint model used. - static get_deployable(
- nemo_checkpoint_filepath: str,
- num_devices: int = None,
- num_nodes: int = None,
- tensor_model_parallel_size: int = 1,
- pipeline_model_parallel_size: int = 1,
- expert_model_parallel_size: int = 1,
- context_parallel_size: int = 1,
- max_batch_size: int = 32,
- random_seed: Optional[int] = None,
- enable_flash_decode: bool = False,
- enable_cuda_graphs: bool = False,
- legacy_ckpt: bool = False,
- Returns the appropriate deployable instance for the given NeMo checkpoint. - Parameters:
- nemo_checkpoint_filepath (str) β Path to the .nemo checkpoint file. 
- num_devices (int) β Number of devices to use for deployment. 
- num_nodes (int) β Number of nodes to use for deployment. 
- tensor_model_parallel_size (int) β Size of the tensor model parallelism. 
- pipeline_model_parallel_size (int) β Size of the pipeline model parallelism. 
- context_parallel_size (int) β Size of the context parallelism. 
- enable_flash_decode (bool) β Whether to enable flash decode for inference. 
- enable_cuda_graphs (bool) β Whether to enable CUDA graphs for inference. 
- legacy_ckpt (bool) β Whether to use legacy checkpoint format. Defaults to False. 
 
- Returns:
- An instance of a deployable class compatible with Triton inference server. 
- Return type:
 
 
- class nemo_deploy.nlp.megatronllm_deployable.MegatronLLMDeployableNemo2(
- num_devices: int = None,
- num_nodes: int = None,
- nemo_checkpoint_filepath: str = None,
- tensor_model_parallel_size: int = 1,
- pipeline_model_parallel_size: int = 1,
- context_parallel_size: int = 1,
- expert_model_parallel_size: int = 1,
- params_dtype: torch.dtype = torch.bfloat16,
- inference_batch_times_seqlen_threshold: int = 32768,
- inference_max_seq_length: int = 4096,
- enable_flash_decode: bool = False,
- enable_cuda_graphs: bool = False,
- max_batch_size: int = 8,
- random_seed: Optional[int] = None,
- legacy_ckpt: bool = False,
- Bases: - nemo_deploy.ITritonDeployable- Triton inference server compatible deploy class for a .nemo model file. - Parameters:
- nemo_checkpoint_filepath (str) β path for the nemo checkpoint. 
- num_devices (int) β number of GPUs. 
- num_nodes (int) β number of nodes. 
- tensor_model_parallel_size (int) β tensor parallelism. 
- pipeline_parallelism_size (int) β pipeline parallelism. 
- context_parallel_size (int) β context parallelism. 
- expert_model_parallel_size (int) β expert parallelism. 
- params_dtype (torch.dtype) β max input length. 
- inference_batch_times_seqlen_threshold (int) β squence threshold. 
- inference_max_seq_length (int) β max_seq_length for inference. Required by MCoreEngine (>=0.12). Defaults to 
- 4096. 
- max_batch_size (int) β max batch size for inference. Defaults to 32. 
- random_seed (Optional[int]) β random seed for inference. Defaults to None. 
- enable_flash_decode (bool) β enable flash decode for inference. Defaults to False. 
- enable_cuda_graphs (bool) β enable CUDA graphs for inference. Defaults to False.` 
- legacy_ckpt (bool) β use legacy checkpoint format. Defaults to False. 
 
 - Initialization - generate(
- prompts: List[str],
- inference_params: Optional[megatron.core.inference.common_inference_params.CommonInferenceParams] = None,
- Generates text based on the provided input prompts. - Parameters:
- prompts (List[str]) β A list of input strings. 
- inference_params (Optional[CommonInferenceParams]) β Parameters for controlling the inference process. 
 
- Returns:
- A list containing the generated results. 
- Return type:
- List[InferenceRequest] 
 
 - apply_chat_template(messages, add_generation_prompt=True)[source]#
- Load the chat template. - Works when modelβs tokenizer has chat template (typically chat models). 
 - remove_eos_token(text)[source]#
- Removes eos token if it exists in the output, otherwise does nothing. 
 - property get_triton_input#
 - property get_triton_output#
 - _infer_fn(
- prompts,
- temperature=0.0,
- top_k=0.0,
- top_p=0.0,
- num_tokens_to_generate=256,
- log_probs=False,
- apply_chat_template=False,
- text_only=True,
- top_logprobs=0,
- echo=False,
- Private helper function that handles the core inference logic shared between triton and ray inference. - Parameters:
- prompts (List[str]) β List of input prompts 
- max_batch_size (int) β Maximum batch size for inference 
- random_seed (int) β Random seed for reproducibility 
- temperature (float) β Sampling temperature 
- top_k (int) β Top-k sampling parameter 
- top_p (float) β Top-p sampling parameter 
- num_tokens_to_generate (int) β Maximum number of tokens to generate 
- log_probs (bool) β Whether to compute log probabilities 
- apply_chat_template (bool) β Whether to apply chat template 
- text_only (bool) β Whether to return only text or full results 
 
- Returns:
- sentences and required log probs. 
- Return type:
- dict 
 
 - ray_infer_fn(inputs: dict)[source]#
- Ray-compatible inference function that takes a dictionary of inputs and returns a dictionary of outputs. - Parameters:
- inputs (dict) β - Dictionary containing the following optional keys: - prompts (List[str]): List of input prompts 
- max_batch_size (int): Maximum batch size for inference (default: 32) 
- random_seed (int): Random seed for reproducibility (default: None) 
- temperature (float): Sampling temperature (default: 1.0) 
- top_k (int): Top-k sampling parameter (default: 1) 
- top_p (float): Top-p sampling parameter (default: 0.0) 
- max_length (int): Maximum number of tokens to generate (default: 256) 
- logprobs (int): Whether to compute log probabilities (default: 0) 
- apply_chat_template (bool): Whether to apply chat template (default: False) 
 
- Returns:
- Dictionary containing: - sentences (List[str]): List of generated texts - log_probs (List[float], optional): List of log probabilities if compute_logprob is True 
- Return type:
- dict