`nemo_deploy.nlp.megatronllm_deployable`#

Module Contents#

Classes#

`MegatronLLMDeploy`	A factory class for creating deployable instances of Megatron LLM models.
`MegatronLLMDeployableNemo2`	Triton inference server compatible deploy class for a .nemo model file.

Functions#

dict_to_str

Serializes dict to str.

Data#

LOGGER

API#

nemo_deploy.nlp.megatronllm_deployable.LOGGER = 'getLogger(...)'#

class nemo_deploy.nlp.megatronllm_deployable.MegatronLLMDeploy#

A factory class for creating deployable instances of Megatron LLM models.

This class provides a method to get the appropriate deployable instance based on the version of the NeMo checkpoint model used.

static get_deployable( nemo_checkpoint_filepath: str, num_devices: int = None, num_nodes: int = None, tensor_model_parallel_size: int = 1, pipeline_model_parallel_size: int = 1, expert_model_parallel_size: int = 1, context_parallel_size: int = 1, max_batch_size: int = 32, random_seed: Optional[int] = None, enable_flash_decode: bool = False, enable_cuda_graphs: bool = False, legacy_ckpt: bool = False, )#

Returns the appropriate deployable instance for the given NeMo checkpoint.

Parameters:

nemo_checkpoint_filepath (str) – Path to the .nemo checkpoint file.
num_devices (int) – Number of devices to use for deployment.
num_nodes (int) – Number of nodes to use for deployment.
tensor_model_parallel_size (int) – Size of the tensor model parallelism.
pipeline_model_parallel_size (int) – Size of the pipeline model parallelism.
context_parallel_size (int) – Size of the context parallelism.
enable_flash_decode (bool) – Whether to enable flash decode for inference.
enable_cuda_graphs (bool) – Whether to enable CUDA graphs for inference.
legacy_ckpt (bool) – Whether to use legacy checkpoint format. Defaults to False.

Returns:

An instance of a deployable class compatible with Triton inference server.

Return type:

ITritonDeployable

nemo_deploy.nlp.megatronllm_deployable.dict_to_str(messages)#: Serializes dict to str.

class nemo_deploy.nlp.megatronllm_deployable.MegatronLLMDeployableNemo2(

num_devices: int = None,

num_nodes: int = None,

nemo_checkpoint_filepath: str = None,

tensor_model_parallel_size: int = 1,

pipeline_model_parallel_size: int = 1,

context_parallel_size: int = 1,

expert_model_parallel_size: int = 1,

params_dtype: torch.dtype = torch.bfloat16,

inference_batch_times_seqlen_threshold: int = 32768,

inference_max_seq_length: int = 4096,

enable_flash_decode: bool = False,

enable_cuda_graphs: bool = False,

max_batch_size: int = 8,

random_seed: Optional[int] = None,

legacy_ckpt: bool = False,

megatron_checkpoint_filepath: str = None,

model_type: str = 'gpt',

model_format: str = 'nemo',

micro_batch_size: Optional[int] = None,

**model_config_kwargs,

)#

Bases: nemo_deploy.ITritonDeployable

Triton inference server compatible deploy class for a .nemo model file.

Parameters:

nemo_checkpoint_filepath (str) – path for the nemo checkpoint.
num_devices (int) – number of GPUs.
num_nodes (int) – number of nodes.
tensor_model_parallel_size (int) – tensor parallelism.
pipeline_parallelism_size (int) – pipeline parallelism.
context_parallel_size (int) – context parallelism.
expert_model_parallel_size (int) – expert parallelism.
params_dtype (torch.dtype) – max input length.
inference_batch_times_seqlen_threshold (int) – squence threshold.
inference_max_seq_length (int) – max_seq_length for inference. Required by MCoreEngine (>=0.12). Defaults to
4096.
max_batch_size (int) – max batch size for inference. Defaults to 32.
random_seed (Optional[int]) – random seed for inference. Defaults to None.
enable_flash_decode (bool) – enable flash decode for inference. Defaults to False.
enable_cuda_graphs (bool) – enable CUDA graphs for inference. Defaults to False.`
legacy_ckpt (bool) – use legacy checkpoint format. Defaults to False.
megatron_checkpoint_filepath (str) – path for the megatron checkpoint.
model_type (str) – type of model to load. Defaults to “gpt”.(Only for Megatron models)
model_format (str) – format of model to load. Defaults to “nemo”.
micro_batch_size (Optional[int]) – micro batch size for model execution. Defaults to None.(Only for Megatron models)

Initialization

generate( prompts: List[str], inference_params: Optional[megatron.core.inference.common_inference_params.CommonInferenceParams] = None, ) → List[megatron.core.inference.inference_request.InferenceRequest]#

Generates text based on the provided input prompts.

Parameters:

prompts (List[str]) – A list of input strings.
inference_params (Optional[CommonInferenceParams]) – Parameters for controlling the inference process.

Returns:

A list containing the generated results.

Return type:

List[InferenceRequest]

generate_other_ranks()#: Generate function for ranks other than the rank 0.

apply_chat_template(messages, add_generation_prompt=True)#

Load the chat template.

Works when model’s tokenizer has chat template (typically chat models).

remove_eos_token(text)#: Removes eos token if it exists in the output, otherwise does nothing.

str_to_dict(json_str)#: Convert str to dict.

property get_triton_input#

property get_triton_output#

triton_infer_fn(**inputs: numpy.ndarray)#

_infer_fn( prompts, temperature=0.0, top_k=0.0, top_p=0.0, num_tokens_to_generate=256, log_probs=False, apply_chat_template=False, text_only=True, top_logprobs=0, echo=False, )#

Private helper function that handles the core inference logic shared between triton and ray inference.

Parameters:

prompts (List[str]) – List of input prompts
max_batch_size (int) – Maximum batch size for inference
random_seed (int) – Random seed for reproducibility
temperature (float) – Sampling temperature
top_k (int) – Top-k sampling parameter
top_p (float) – Top-p sampling parameter
num_tokens_to_generate (int) – Maximum number of tokens to generate
log_probs (bool) – Whether to compute log probabilities
apply_chat_template (bool) – Whether to apply chat template
text_only (bool) – Whether to return only text or full results
top_logprobs (int) – Number of top logprobs to return
echo (bool) – If True, returns the prompt and generated text. If log_probs is True, returns the prompt and
0 (generated log_probs. If top_logprobs is >)
top_logprobs. (returns the prompt and generated)

Returns:

sentences and required log probs.

Return type:

dict

ray_infer_fn(inputs: dict)#

Ray-compatible inference function that takes a dictionary of inputs and returns a dictionary of outputs.

Parameters:

inputs (dict) –

Dictionary containing the following optional keys:

prompts (List[str]): List of input prompts
max_batch_size (int): Maximum batch size for inference (default: 32)
random_seed (int): Random seed for reproducibility (default: None)
temperature (float): Sampling temperature (default: 1.0)
top_k (int): Top-k sampling parameter (default: 1)
top_p (float): Top-p sampling parameter (default: 0.0)
max_length (int): Maximum number of tokens to generate (default: 256)
compute_logprob (bool): Whether to compute log probabilities (default: False)
apply_chat_template (bool): Whether to apply chat template (default: False)
n_top_logprobs (int): Number of log probabilities to include in the response, if applicabl (default: 0)
echo (bool): Whether to return the input text as part of the response. (default: False)

Returns:

Dictionary containing: - sentences (List[str]): List of generated texts - log_probs (List[float], optional): List of log probabilities if compute_logprob is True

Return type:

dict

nemo_deploy.nlp.megatronllm_deployable#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_deploy.nlp.megatronllm_deployable`#