core.inference.text_generation_controllers.vlm_text_generation_controller#

Module Contents#

Classes#

VLMTextGenerationController

The text generation controller for VLMs

API#

class core.inference.text_generation_controllers.vlm_text_generation_controller.VLMTextGenerationController(
inference_wrapped_model: megatron.core.inference.model_inference_wrappers.abstract_model_inference_wrapper.AbstractModelInferenceWrapper,
tokenizer,
pp_group: torch.distributed.ProcessGroup = None,
)#

Bases: megatron.core.inference.text_generation_controllers.text_generation_controller.TextGenerationController

The text generation controller for VLMs

Initialization

prep_inference_input(
prompts_tokens: torch.Tensor,
active_requests: OrderedDict[str, megatron.core.inference.inference_request.InferenceRequest],
use_attention_mask: bool = False,
)#

Preparing input data for inference, using respective wrapper’s prep_inference_input method # pylint: disable=line-too-long

Currently only supports batch size 1 inference.

Parameters:
  • prompts_tokens (torch.Tensor) – A tensor of shape [batch_size, max_sequence_length]

  • active_requests (OrderedDict[str, InferenceRequest]) – The input active requests

  • use_attention_mask (bool) – Whether to use an attention mask. Should be set to True only when exclusively doing prefill (no decode) with variable prompt lengths.