nemo_deploy.multimodal.nemo_multimodal_deployable#

Module Contents#

Classes#

NeMoMultimodalDeployable

Triton inference server compatible deploy class for a NeMo multimodal model file.

Functions#

dict_to_str

Serializes dict to str.

Data#

API#

nemo_deploy.multimodal.nemo_multimodal_deployable.LOGGER = 'getLogger(...)'#
nemo_deploy.multimodal.nemo_multimodal_deployable.dict_to_str(messages)#

Serializes dict to str.

class nemo_deploy.multimodal.nemo_multimodal_deployable.NeMoMultimodalDeployable(
nemo_checkpoint_filepath: str = None,
tensor_parallel_size: int = 1,
pipeline_parallel_size: int = 1,
params_dtype: torch.dtype = torch.bfloat16,
inference_batch_times_seqlen_threshold: int = 1000,
)#

Bases: nemo_deploy.ITritonDeployable

Triton inference server compatible deploy class for a NeMo multimodal model file.

Parameters:
  • nemo_checkpoint_filepath (str) – path for the nemo checkpoint.

  • tensor_parallel_size (int) – tensor parallelism.

  • pipeline_parallel_size (int) – pipeline parallelism.

  • params_dtype (torch.dtype) – data type for model parameters.

  • inference_batch_times_seqlen_threshold (int) – sequence threshold.

Initialization

generate(
prompts: List[str],
images: List[PIL.Image.Image],
inference_params: Optional[megatron.core.inference.common_inference_params.CommonInferenceParams] = None,
max_batch_size: int = 4,
random_seed: Optional[int] = None,
) dict#

Generates text based on the provided input prompts and images.

Parameters:
  • prompts (List[str]) – A list of input strings.

  • images (List[Union[Image, List[Image]]]) – A list of input images.

  • inference_params (Optional[CommonInferenceParams]) – Parameters for controlling the inference process.

  • max_batch_size (int) – max batch size for inference. Defaults to 4.

  • random_seed (Optional[int]) – random seed for inference. Defaults to None.

Returns:

A dictionary containing the generated results.

Return type:

dict

property get_triton_input#
property get_triton_output#
triton_infer_fn(**inputs: numpy.ndarray)#
_infer_fn(
prompts,
images,
temperature=1.0,
top_k=1,
top_p=0.0,
num_tokens_to_generate=256,
random_seed=None,
max_batch_size=4,
)#

Private helper function that handles the core inference logic shared between triton and ray inference.

Parameters:
  • prompts (List[str]) – List of input prompts

  • images (List[Union[Image, List[Image]]]) – List of input images

  • temperature (float) – Sampling temperature

  • top_k (int) – Top-k sampling parameter

  • top_p (float) – Top-p sampling parameter

  • num_tokens_to_generate (int) – Maximum number of tokens to generate

  • random_seed (Optional[int]) – Random seed for inference

  • max_batch_size (int) – Maximum batch size for inference

Returns:

sentences.

Return type:

dict