nemo_deploy.nlp.megatronllm_deployable
#
Module Contents#
Classes#
A factory class for creating deployable instances of Megatron LLM models. |
|
Triton inference server compatible deploy class for a .nemo model file. |
Functions#
Serializes dict to str. |
Data#
API#
- nemo_deploy.nlp.megatronllm_deployable.LOGGER = 'getLogger(...)'#
- class nemo_deploy.nlp.megatronllm_deployable.MegatronLLMDeploy#
A factory class for creating deployable instances of Megatron LLM models.
This class provides a method to get the appropriate deployable instance based on the version of the NeMo checkpoint model used.
- static get_deployable(
- nemo_checkpoint_filepath: str,
- num_devices: int = None,
- num_nodes: int = None,
- tensor_model_parallel_size: int = 1,
- pipeline_model_parallel_size: int = 1,
- expert_model_parallel_size: int = 1,
- context_parallel_size: int = 1,
- max_batch_size: int = 32,
- random_seed: Optional[int] = None,
- enable_flash_decode: bool = False,
- enable_cuda_graphs: bool = False,
- legacy_ckpt: bool = False,
Returns the appropriate deployable instance for the given NeMo checkpoint.
- Parameters:
nemo_checkpoint_filepath (str) β Path to the .nemo checkpoint file.
num_devices (int) β Number of devices to use for deployment.
num_nodes (int) β Number of nodes to use for deployment.
tensor_model_parallel_size (int) β Size of the tensor model parallelism.
pipeline_model_parallel_size (int) β Size of the pipeline model parallelism.
context_parallel_size (int) β Size of the context parallelism.
enable_flash_decode (bool) β Whether to enable flash decode for inference.
enable_cuda_graphs (bool) β Whether to enable CUDA graphs for inference.
legacy_ckpt (bool) β Whether to use legacy checkpoint format. Defaults to False.
- Returns:
An instance of a deployable class compatible with Triton inference server.
- Return type:
- nemo_deploy.nlp.megatronllm_deployable.dict_to_str(messages)#
Serializes dict to str.
- class nemo_deploy.nlp.megatronllm_deployable.MegatronLLMDeployableNemo2(
- num_devices: int = None,
- num_nodes: int = None,
- nemo_checkpoint_filepath: str = None,
- tensor_model_parallel_size: int = 1,
- pipeline_model_parallel_size: int = 1,
- context_parallel_size: int = 1,
- expert_model_parallel_size: int = 1,
- params_dtype: torch.dtype = torch.bfloat16,
- inference_batch_times_seqlen_threshold: int = 32768,
- inference_max_seq_length: int = 4096,
- enable_flash_decode: bool = False,
- enable_cuda_graphs: bool = False,
- max_batch_size: int = 8,
- random_seed: Optional[int] = None,
- legacy_ckpt: bool = False,
- megatron_checkpoint_filepath: str = None,
- model_type: str = 'gpt',
- model_format: str = 'nemo',
- micro_batch_size: Optional[int] = None,
- **model_config_kwargs,
Bases:
nemo_deploy.ITritonDeployable
Triton inference server compatible deploy class for a .nemo model file.
- Parameters:
nemo_checkpoint_filepath (str) β path for the nemo checkpoint.
num_devices (int) β number of GPUs.
num_nodes (int) β number of nodes.
tensor_model_parallel_size (int) β tensor parallelism.
pipeline_parallelism_size (int) β pipeline parallelism.
context_parallel_size (int) β context parallelism.
expert_model_parallel_size (int) β expert parallelism.
params_dtype (torch.dtype) β max input length.
inference_batch_times_seqlen_threshold (int) β squence threshold.
inference_max_seq_length (int) β max_seq_length for inference. Required by MCoreEngine (>=0.12). Defaults to
4096.
max_batch_size (int) β max batch size for inference. Defaults to 32.
random_seed (Optional[int]) β random seed for inference. Defaults to None.
enable_flash_decode (bool) β enable flash decode for inference. Defaults to False.
enable_cuda_graphs (bool) β enable CUDA graphs for inference. Defaults to False.`
legacy_ckpt (bool) β use legacy checkpoint format. Defaults to False.
megatron_checkpoint_filepath (str) β path for the megatron checkpoint.
model_type (str) β type of model to load. Defaults to βgptβ.(Only for Megatron models)
model_format (str) β format of model to load. Defaults to βnemoβ.
micro_batch_size (Optional[int]) β micro batch size for model execution. Defaults to None.(Only for Megatron models)
Initialization
- generate(
- prompts: List[str],
- inference_params: Optional[megatron.core.inference.common_inference_params.CommonInferenceParams] = None,
Generates text based on the provided input prompts.
- Parameters:
prompts (List[str]) β A list of input strings.
inference_params (Optional[CommonInferenceParams]) β Parameters for controlling the inference process.
- Returns:
A list containing the generated results.
- Return type:
List[InferenceRequest]
- generate_other_ranks()#
Generate function for ranks other than the rank 0.
- apply_chat_template(messages, add_generation_prompt=True)#
Load the chat template.
Works when modelβs tokenizer has chat template (typically chat models).
- remove_eos_token(text)#
Removes eos token if it exists in the output, otherwise does nothing.
- str_to_dict(json_str)#
Convert str to dict.
- property get_triton_input#
- property get_triton_output#
- triton_infer_fn(**inputs: numpy.ndarray)#
- _infer_fn(
- prompts,
- temperature=0.0,
- top_k=0.0,
- top_p=0.0,
- num_tokens_to_generate=256,
- log_probs=False,
- apply_chat_template=False,
- text_only=True,
- top_logprobs=0,
- echo=False,
Private helper function that handles the core inference logic shared between triton and ray inference.
- Parameters:
prompts (List[str]) β List of input prompts
max_batch_size (int) β Maximum batch size for inference
random_seed (int) β Random seed for reproducibility
temperature (float) β Sampling temperature
top_k (int) β Top-k sampling parameter
top_p (float) β Top-p sampling parameter
num_tokens_to_generate (int) β Maximum number of tokens to generate
log_probs (bool) β Whether to compute log probabilities
apply_chat_template (bool) β Whether to apply chat template
text_only (bool) β Whether to return only text or full results
top_logprobs (int) β Number of top logprobs to return
echo (bool) β If True, returns the prompt and generated text. If log_probs is True, returns the prompt and
0 (generated log_probs. If top_logprobs is >)
top_logprobs. (returns the prompt and generated)
- Returns:
sentences and required log probs.
- Return type:
dict
- ray_infer_fn(inputs: dict)#
Ray-compatible inference function that takes a dictionary of inputs and returns a dictionary of outputs.
- Parameters:
inputs (dict) β
Dictionary containing the following optional keys:
prompts (List[str]): List of input prompts
max_batch_size (int): Maximum batch size for inference (default: 32)
random_seed (int): Random seed for reproducibility (default: None)
temperature (float): Sampling temperature (default: 1.0)
top_k (int): Top-k sampling parameter (default: 1)
top_p (float): Top-p sampling parameter (default: 0.0)
max_length (int): Maximum number of tokens to generate (default: 256)
compute_logprob (bool): Whether to compute log probabilities (default: False)
apply_chat_template (bool): Whether to apply chat template (default: False)
n_top_logprobs (int): Number of log probabilities to include in the response, if applicabl (default: 0)
echo (bool): Whether to return the input text as part of the response. (default: False)
- Returns:
Dictionary containing: - sentences (List[str]): List of generated texts - log_probs (List[float], optional): List of log probabilities if compute_logprob is True
- Return type:
dict