nemo_deploy.multimodal.nemo_multimodal_deployable#
Module Contents#
Classes#
Triton inference server compatible deploy class for a NeMo multimodal model file. |
Functions#
Serializes dict to str. |
Data#
API#
- nemo_deploy.multimodal.nemo_multimodal_deployable.LOGGER = 'getLogger(...)'#
- nemo_deploy.multimodal.nemo_multimodal_deployable.dict_to_str(messages)#
Serializes dict to str.
- class nemo_deploy.multimodal.nemo_multimodal_deployable.NeMoMultimodalDeployable(
- nemo_checkpoint_filepath: str = None,
- tensor_parallel_size: int = 1,
- pipeline_parallel_size: int = 1,
- params_dtype: torch.dtype = torch.bfloat16,
- inference_batch_times_seqlen_threshold: int = 1000,
Bases:
nemo_deploy.ITritonDeployableTriton inference server compatible deploy class for a NeMo multimodal model file.
- Parameters:
nemo_checkpoint_filepath (str) – path for the nemo checkpoint.
tensor_parallel_size (int) – tensor parallelism.
pipeline_parallel_size (int) – pipeline parallelism.
params_dtype (torch.dtype) – data type for model parameters.
inference_batch_times_seqlen_threshold (int) – sequence threshold.
Initialization
- generate(
- prompts: List[str],
- images: List[PIL.Image],
- inference_params: Optional[megatron.core.inference.common_inference_params.CommonInferenceParams] = None,
- max_batch_size: int = 4,
- random_seed: Optional[int] = None,
- apply_chat_template: bool = False,
Generates text based on the provided input prompts and images.
- Parameters:
prompts (List[str]) – A list of input strings.
images (List[Union[Image, List[Image]]]) – A list of input images.
inference_params (Optional[CommonInferenceParams]) – Parameters for controlling the inference process.
max_batch_size (int) – max batch size for inference. Defaults to 4.
random_seed (Optional[int]) – random seed for inference. Defaults to None.
apply_chat_template (bool) – Whether to apply chat template. Defaults to False.
- Returns:
A dictionary containing the generated results.
- Return type:
dict
- apply_chat_template(messages, add_generation_prompt=True)#
Apply the chat template using the processor.
Works when model’s processor has chat template (typically chat models).
- base64_to_image(image_base64)#
Convert base64-encoded image to PIL Image.
- property get_triton_input#
- property get_triton_output#
- triton_infer_fn(**inputs: numpy.ndarray)#
- _infer_fn(
- prompts,
- images,
- temperature=1.0,
- top_k=1,
- top_p=0.0,
- num_tokens_to_generate=256,
- random_seed=None,
- max_batch_size=4,
- apply_chat_template=False,
Private helper function that handles the core inference logic shared between triton and ray inference.
- Parameters:
prompts (List[str]) – List of input prompts
images (List[str]) – List of input base64 encoded images
temperature (float) – Sampling temperature
top_k (int) – Top-k sampling parameter
top_p (float) – Top-p sampling parameter
num_tokens_to_generate (int) – Maximum number of tokens to generate
random_seed (Optional[int]) – Random seed for inference
max_batch_size (int) – Maximum batch size for inference
apply_chat_template (bool) – Whether to apply chat template
- Returns:
sentences.
- Return type:
dict
- ray_infer_fn(inputs: dict)#
Ray-compatible inference function that takes a dictionary of inputs and returns a dictionary of outputs.
- Parameters:
inputs (dict) –
Dictionary containing the following optional keys:
prompts (List[str]): List of input prompts
images (List[str]): List of input base64 encoded images
temperature (float): Sampling temperature (default: 1.0)
top_k (int): Top-k sampling parameter (default: 1)
top_p (float): Top-p sampling parameter (default: 0.0)
max_length (int): Maximum number of tokens to generate (default: 50)
random_seed (Optional[int]): Random seed for reproducibility (default: None)
max_batch_size (int): Maximum batch size for inference (default: 4)
apply_chat_template (bool): Whether to apply chat template (default: False)
- Returns:
Dictionary containing: - sentences (List[str]): List of generated texts
- Return type:
dict