nemo_deploy.multimodal.nemo_multimodal_deployable#

Module Contents#

Classes#

NeMoMultimodalDeployable

Triton inference server compatible deploy class for a NeMo multimodal model file.

Functions#

dict_to_str

Serializes dict to str.

Data#

API#

nemo_deploy.multimodal.nemo_multimodal_deployable.LOGGER = 'getLogger(...)'#
nemo_deploy.multimodal.nemo_multimodal_deployable.dict_to_str(messages)#

Serializes dict to str.

class nemo_deploy.multimodal.nemo_multimodal_deployable.NeMoMultimodalDeployable(
nemo_checkpoint_filepath: str = None,
tensor_parallel_size: int = 1,
pipeline_parallel_size: int = 1,
params_dtype: torch.dtype = torch.bfloat16,
inference_batch_times_seqlen_threshold: int = 1000,
)#

Bases: nemo_deploy.ITritonDeployable

Triton inference server compatible deploy class for a NeMo multimodal model file.

Parameters:
  • nemo_checkpoint_filepath (str) – path for the nemo checkpoint.

  • tensor_parallel_size (int) – tensor parallelism.

  • pipeline_parallel_size (int) – pipeline parallelism.

  • params_dtype (torch.dtype) – data type for model parameters.

  • inference_batch_times_seqlen_threshold (int) – sequence threshold.

Initialization

generate(
prompts: List[str],
images: List[PIL.Image],
inference_params: Optional[megatron.core.inference.common_inference_params.CommonInferenceParams] = None,
max_batch_size: int = 4,
random_seed: Optional[int] = None,
apply_chat_template: bool = False,
) dict#

Generates text based on the provided input prompts and images.

Parameters:
  • prompts (List[str]) – A list of input strings.

  • images (List[Union[Image, List[Image]]]) – A list of input images.

  • inference_params (Optional[CommonInferenceParams]) – Parameters for controlling the inference process.

  • max_batch_size (int) – max batch size for inference. Defaults to 4.

  • random_seed (Optional[int]) – random seed for inference. Defaults to None.

  • apply_chat_template (bool) – Whether to apply chat template. Defaults to False.

Returns:

A dictionary containing the generated results.

Return type:

dict

apply_chat_template(messages, add_generation_prompt=True)#

Apply the chat template using the processor.

Works when model’s processor has chat template (typically chat models).

base64_to_image(image_base64)#

Convert base64-encoded image to PIL Image.

property get_triton_input#
property get_triton_output#
triton_infer_fn(**inputs: numpy.ndarray)#
_infer_fn(
prompts,
images,
temperature=1.0,
top_k=1,
top_p=0.0,
num_tokens_to_generate=256,
random_seed=None,
max_batch_size=4,
apply_chat_template=False,
)#

Private helper function that handles the core inference logic shared between triton and ray inference.

Parameters:
  • prompts (List[str]) – List of input prompts

  • images (List[str]) – List of input base64 encoded images

  • temperature (float) – Sampling temperature

  • top_k (int) – Top-k sampling parameter

  • top_p (float) – Top-p sampling parameter

  • num_tokens_to_generate (int) – Maximum number of tokens to generate

  • random_seed (Optional[int]) – Random seed for inference

  • max_batch_size (int) – Maximum batch size for inference

  • apply_chat_template (bool) – Whether to apply chat template

Returns:

sentences.

Return type:

dict

ray_infer_fn(inputs: dict)#

Ray-compatible inference function that takes a dictionary of inputs and returns a dictionary of outputs.

Parameters:

inputs (dict) –

Dictionary containing the following optional keys:

  • prompts (List[str]): List of input prompts

  • images (List[str]): List of input base64 encoded images

  • temperature (float): Sampling temperature (default: 1.0)

  • top_k (int): Top-k sampling parameter (default: 1)

  • top_p (float): Top-p sampling parameter (default: 0.0)

  • max_length (int): Maximum number of tokens to generate (default: 50)

  • random_seed (Optional[int]): Random seed for reproducibility (default: None)

  • max_batch_size (int): Maximum batch size for inference (default: 4)

  • apply_chat_template (bool): Whether to apply chat template (default: False)

Returns:

Dictionary containing: - sentences (List[str]): List of generated texts

Return type:

dict