core.inference.text_generation_controllers.vlm_text_generation_controller#
Module Contents#
Classes#
The text generation controller for VLMs |
API#
- class core.inference.text_generation_controllers.vlm_text_generation_controller.VLMTextGenerationController(
- inference_wrapped_model: megatron.core.inference.model_inference_wrappers.abstract_model_inference_wrapper.AbstractModelInferenceWrapper,
- tokenizer,
- pp_group: torch.distributed.ProcessGroup = None,
Bases:
megatron.core.inference.text_generation_controllers.text_generation_controller.TextGenerationControllerThe text generation controller for VLMs
Initialization
- prep_inference_input(
- prompts_tokens: torch.Tensor,
- active_requests: OrderedDict[str, megatron.core.inference.inference_request.InferenceRequest],
- use_attention_mask: bool = False,
Preparing input data for inference, using respective wrapper’s prep_inference_input method # pylint: disable=line-too-long
Currently only supports batch size 1 inference.
- Parameters:
prompts_tokens (torch.Tensor) – A tensor of shape [batch_size, max_sequence_length]
active_requests (OrderedDict[str, InferenceRequest]) – The input active requests
use_attention_mask (bool) – Whether to use an attention mask. Should be set to True only when exclusively doing prefill (no decode) with variable prompt lengths.