nemo_deploy.nlp.megatronllm_deployable#

Module Contents#

Classes#

MegatronLLMDeploy

A factory class for creating deployable instances of Megatron LLM models.

MegatronLLMDeployableNemo2

Triton inference server compatible deploy class for a .nemo model file.

Functions#

dict_to_str

Serializes dict to str.

Data#

API#

nemo_deploy.nlp.megatronllm_deployable.LOGGER = 'getLogger(...)'#
class nemo_deploy.nlp.megatronllm_deployable.MegatronLLMDeploy#

A factory class for creating deployable instances of Megatron LLM models.

This class provides a method to get the appropriate deployable instance based on the version of the NeMo checkpoint model used.

static get_deployable(
nemo_checkpoint_filepath: str,
num_devices: int = None,
num_nodes: int = None,
tensor_model_parallel_size: int = 1,
pipeline_model_parallel_size: int = 1,
expert_model_parallel_size: int = 1,
context_parallel_size: int = 1,
max_batch_size: int = 32,
random_seed: Optional[int] = None,
enable_flash_decode: bool = False,
enable_cuda_graphs: bool = False,
legacy_ckpt: bool = False,
)#

Returns the appropriate deployable instance for the given NeMo checkpoint.

Parameters:
  • nemo_checkpoint_filepath (str) – Path to the .nemo checkpoint file.

  • num_devices (int) – Number of devices to use for deployment.

  • num_nodes (int) – Number of nodes to use for deployment.

  • tensor_model_parallel_size (int) – Size of the tensor model parallelism.

  • pipeline_model_parallel_size (int) – Size of the pipeline model parallelism.

  • context_parallel_size (int) – Size of the context parallelism.

  • enable_flash_decode (bool) – Whether to enable flash decode for inference.

  • enable_cuda_graphs (bool) – Whether to enable CUDA graphs for inference.

  • legacy_ckpt (bool) – Whether to use legacy checkpoint format. Defaults to False.

Returns:

An instance of a deployable class compatible with Triton inference server.

Return type:

ITritonDeployable

nemo_deploy.nlp.megatronllm_deployable.dict_to_str(messages)#

Serializes dict to str.

class nemo_deploy.nlp.megatronllm_deployable.MegatronLLMDeployableNemo2(
num_devices: int = None,
num_nodes: int = None,
nemo_checkpoint_filepath: str = None,
tensor_model_parallel_size: int = 1,
pipeline_model_parallel_size: int = 1,
context_parallel_size: int = 1,
expert_model_parallel_size: int = 1,
params_dtype: torch.dtype = torch.bfloat16,
inference_batch_times_seqlen_threshold: int = 32768,
inference_max_seq_length: int = 4096,
enable_flash_decode: bool = False,
enable_cuda_graphs: bool = False,
max_batch_size: int = 8,
random_seed: Optional[int] = None,
legacy_ckpt: bool = False,
megatron_checkpoint_filepath: str = None,
model_type: str = 'gpt',
model_format: str = 'nemo',
micro_batch_size: Optional[int] = None,
**model_config_kwargs,
)#

Bases: nemo_deploy.ITritonDeployable

Triton inference server compatible deploy class for a .nemo model file.

Parameters:
  • nemo_checkpoint_filepath (str) – path for the nemo checkpoint.

  • num_devices (int) – number of GPUs.

  • num_nodes (int) – number of nodes.

  • tensor_model_parallel_size (int) – tensor parallelism.

  • pipeline_parallelism_size (int) – pipeline parallelism.

  • context_parallel_size (int) – context parallelism.

  • expert_model_parallel_size (int) – expert parallelism.

  • params_dtype (torch.dtype) – max input length.

  • inference_batch_times_seqlen_threshold (int) – squence threshold.

  • inference_max_seq_length (int) – max_seq_length for inference. Required by MCoreEngine (>=0.12). Defaults to

  • 4096.

  • max_batch_size (int) – max batch size for inference. Defaults to 32.

  • random_seed (Optional[int]) – random seed for inference. Defaults to None.

  • enable_flash_decode (bool) – enable flash decode for inference. Defaults to False.

  • enable_cuda_graphs (bool) – enable CUDA graphs for inference. Defaults to False.`

  • legacy_ckpt (bool) – use legacy checkpoint format. Defaults to False.

  • megatron_checkpoint_filepath (str) – path for the megatron checkpoint.

  • model_type (str) – type of model to load. Defaults to β€œgpt”.(Only for Megatron models)

  • model_format (str) – format of model to load. Defaults to β€œnemo”.

  • micro_batch_size (Optional[int]) – micro batch size for model execution. Defaults to None.(Only for Megatron models)

Initialization

generate(
prompts: List[str],
inference_params: Optional[megatron.core.inference.common_inference_params.CommonInferenceParams] = None,
) List[megatron.core.inference.inference_request.InferenceRequest]#

Generates text based on the provided input prompts.

Parameters:
  • prompts (List[str]) – A list of input strings.

  • inference_params (Optional[CommonInferenceParams]) – Parameters for controlling the inference process.

Returns:

A list containing the generated results.

Return type:

List[InferenceRequest]

generate_other_ranks()#

Generate function for ranks other than the rank 0.

apply_chat_template(messages, add_generation_prompt=True)#

Load the chat template.

Works when model’s tokenizer has chat template (typically chat models).

remove_eos_token(text)#

Removes eos token if it exists in the output, otherwise does nothing.

str_to_dict(json_str)#

Convert str to dict.

property get_triton_input#
property get_triton_output#
triton_infer_fn(**inputs: numpy.ndarray)#
_infer_fn(
prompts,
temperature=0.0,
top_k=0.0,
top_p=0.0,
num_tokens_to_generate=256,
log_probs=False,
apply_chat_template=False,
text_only=True,
top_logprobs=0,
echo=False,
)#

Private helper function that handles the core inference logic shared between triton and ray inference.

Parameters:
  • prompts (List[str]) – List of input prompts

  • max_batch_size (int) – Maximum batch size for inference

  • random_seed (int) – Random seed for reproducibility

  • temperature (float) – Sampling temperature

  • top_k (int) – Top-k sampling parameter

  • top_p (float) – Top-p sampling parameter

  • num_tokens_to_generate (int) – Maximum number of tokens to generate

  • log_probs (bool) – Whether to compute log probabilities

  • apply_chat_template (bool) – Whether to apply chat template

  • text_only (bool) – Whether to return only text or full results

  • top_logprobs (int) – Number of top logprobs to return

  • echo (bool) – If True, returns the prompt and generated text. If log_probs is True, returns the prompt and

  • 0 (generated log_probs. If top_logprobs is >)

  • top_logprobs. (returns the prompt and generated)

Returns:

sentences and required log probs.

Return type:

dict

ray_infer_fn(inputs: dict)#

Ray-compatible inference function that takes a dictionary of inputs and returns a dictionary of outputs.

Parameters:

inputs (dict) –

Dictionary containing the following optional keys:

  • prompts (List[str]): List of input prompts

  • max_batch_size (int): Maximum batch size for inference (default: 32)

  • random_seed (int): Random seed for reproducibility (default: None)

  • temperature (float): Sampling temperature (default: 1.0)

  • top_k (int): Top-k sampling parameter (default: 1)

  • top_p (float): Top-p sampling parameter (default: 0.0)

  • max_length (int): Maximum number of tokens to generate (default: 256)

  • compute_logprob (bool): Whether to compute log probabilities (default: False)

  • apply_chat_template (bool): Whether to apply chat template (default: False)

  • n_top_logprobs (int): Number of log probabilities to include in the response, if applicabl (default: 0)

  • echo (bool): Whether to return the input text as part of the response. (default: False)

Returns:

Dictionary containing: - sentences (List[str]): List of generated texts - log_probs (List[float], optional): List of log probabilities if compute_logprob is True

Return type:

dict