bridge.models.hf_pretrained.vlm#

Module Contents#

Classes#

PreTrainedVLM

A generic class for Pretrained Vision-Language Models with lazy loading.

Data#

API#

bridge.models.hf_pretrained.vlm.VLMType#

‘TypeVar(…)’

class bridge.models.hf_pretrained.vlm.PreTrainedVLM(
model_name_or_path: Optional[Union[str, pathlib.Path]] = None,
device: Optional[Union[str, torch.device]] = None,
torch_dtype: Optional[torch.dtype] = None,
trust_remote_code: bool = False,
**kwargs,
)#

Bases: megatron.bridge.models.hf_pretrained.base.PreTrainedBase, typing.Generic[bridge.models.hf_pretrained.vlm.VLMType]

A generic class for Pretrained Vision-Language Models with lazy loading.

Allows type-safe access to specific VLM implementations like LlavaForConditionalGeneration.

.. rubric:: Examples

Basic usage with image and text:

from megatron.bridge.models.hf_pretrained.vlm import PreTrainedVLM from PIL import Image

Create instance - no model loading happens yet

vlm = PreTrainedVLM.from_pretrained(“llava-hf/llava-1.5-7b-hf”)

Load an image

image = Image.open(“cat.jpg”)

Process image and text together - processor and model load here

inputs = vlm.process_images_and_text( … images=image, … text=”What do you see in this image?” … )

Generate response

outputs = vlm.generate(**inputs, max_new_tokens=100) print(vlm.decode(outputs[0], skip_special_tokens=True))

Batch processing with multiple images:

Process multiple images with questions

images = [Image.open(f”image_{i}.jpg”) for i in range(3)] questions = [ … “What is the main object in this image?”, … “Describe the scene”, … “What colors do you see?” … ]

Process batch

inputs = vlm.process_images_and_text( … images=images, … text=questions, … padding=True … )

Generate responses

outputs = vlm.generate(**inputs, max_new_tokens=50) for i, output in enumerate(outputs): … print(f”Image {i+1}: {vlm.decode(output, skip_special_tokens=True)}”)

Using specific VLM types with type hints:

from transformers import LlavaForConditionalGeneration from megatron.bridge.models.hf_pretrained.vlm import PreTrainedVLM

Type-safe access to Llava-specific features

llava: PreTrainedVLM[LlavaForConditionalGeneration] = PreTrainedVLM.from_pretrained( … “llava-hf/llava-1.5-7b-hf”, … torch_dtype=torch.float16, … device=”cuda” … )

Access model-specific attributes

vision_tower = llava.model.vision_tower # Type-safe access

Text-only generation (for multimodal models that support it):

Some VLMs can also work with text-only inputs

text_inputs = vlm.encode_text(“Explain what a neural network is.”) outputs = vlm.generate(**text_inputs, max_length=100) print(vlm.decode(outputs[0], skip_special_tokens=True))

Custom preprocessing and generation:

Load with custom settings

vlm = PreTrainedVLM.from_pretrained( … “Qwen/Qwen-VL-Chat”, … trust_remote_code=True, … device_map=”auto”, … load_in_4bit=True … )

Custom generation config

from transformers import GenerationConfig vlm.generation_config = GenerationConfig( … max_new_tokens=200, … temperature=0.8, … top_p=0.95, … do_sample=True … )

Process with custom parameters

inputs = vlm.process_images_and_text( … images=image, … text=”\nDescribe this image in detail.”, … max_length=512 … )

Manual component setup:

Create empty instance

vlm = PreTrainedVLM()

Load components separately

from transformers import AutoProcessor, AutoModel vlm.processor = AutoProcessor.from_pretrained(“microsoft/Florence-2-base”) vlm.model = AutoModel.from_pretrained(“microsoft/Florence-2-base”)

Use for various vision tasks

task_prompt = “” # Object detection task inputs = vlm.process_images_and_text(images=image, text=task_prompt) outputs = vlm.generate(**inputs)

Conversational VLM usage:

Multi-turn conversation with images

conversation = []

First turn

image1 = Image.open(“chart.png”) inputs = vlm.process_images_and_text( … images=image1, … text=”What type of chart is this?” … ) response = vlm.generate(**inputs) conversation.append((“user”, “What type of chart is this?”)) conversation.append((“assistant”, vlm.decode(response[0])))

Follow-up question

follow_up = “What is the highest value shown?”

Format conversation history + new question

full_prompt = format_conversation(conversation) + f”\nUser: {follow_up}” inputs = vlm.process_images_and_text(images=image1, text=full_prompt) response = vlm.generate(**inputs)

Initialization

Initialize a Pretrained VLM with lazy loading.

Parameters:
  • model_name_or_path – HuggingFace model identifier or local path

  • device – Device to load model on (e.g., ‘cuda’, ‘cpu’)

  • torch_dtype – Data type to load model in (e.g., torch.float16)

  • trust_remote_code – Whether to trust remote code when loading

  • **kwargs – Additional arguments passed to component loaders

ARTIFACTS#

[‘processor’, ‘tokenizer’, ‘image_processor’]

OPTIONAL_ARTIFACTS#

[‘generation_config’]

_load_model() bridge.models.hf_pretrained.vlm.VLMType#

Lazy load and return the model.

_load_config() transformers.AutoConfig#

Lazy load and return the model config.

_load_processor() transformers.ProcessorMixin#

Lazy load and return the processor.

_load_tokenizer() Optional[transformers.PreTrainedTokenizer]#

Lazy load and return the tokenizer. For VLMs, the tokenizer might be included in the processor.

_load_image_processor() Optional[Any]#

Lazy load and return the image processor. For VLMs, the image processor might be included in the processor.

_load_generation_config() Optional[transformers.GenerationConfig]#

Lazy load and return the generation config.

property model_name_or_path: Optional[Union[str, pathlib.Path]]#

Return the model name or path.

property model: bridge.models.hf_pretrained.vlm.VLMType#

Lazy load and return the underlying model.

property processor: transformers.ProcessorMixin#

Lazy load and return the processor.

property tokenizer: Optional[transformers.PreTrainedTokenizer]#

Lazy load and return the tokenizer.

property image_processor: Optional[Any]#

Lazy load and return the image processor.

property generation_config: Optional[transformers.GenerationConfig]#

Lazy load and return the generation config.

property kwargs: Dict[str, Any]#

Additional initialization kwargs.

classmethod from_pretrained(
model_name_or_path: Union[str, pathlib.Path],
device: Optional[Union[str, torch.device]] = None,
torch_dtype: Optional[torch.dtype] = None,
trust_remote_code: bool = False,
**kwargs,
) PreTrainedVLM[VLMType]#

Create a PreTrainedVLM instance for lazy loading.

Parameters:
  • model_name_or_path – HuggingFace model identifier or local path

  • device – Device to load model on

  • torch_dtype – Data type to load model in

  • trust_remote_code – Whether to trust remote code

  • **kwargs – Additional arguments for from_pretrained methods

Returns:

PreTrainedVLM instance configured for lazy loading

generate(
**kwargs,
) Union[torch.LongTensor, transformers.generation.utils.GenerateOutput]#

Generate sequences using the model.

Parameters:

**kwargs – Arguments for the generate method

Returns:

Generated sequences

__call__(*args, **kwargs)#

Forward pass through the model.

encode_text(
text: Union[str, List[str]],
**kwargs,
) Dict[str, torch.Tensor]#

Encode text input using the tokenizer.

Parameters:
  • text – Input text or list of texts

  • **kwargs – Additional tokenizer arguments

Returns:

Encoded inputs ready for the model

decode(token_ids: torch.Tensor, **kwargs) str#

Decode token IDs to text.

Parameters:
  • token_ids – Token IDs to decode

  • **kwargs – Additional decoding arguments

Returns:

Decoded text

process_images_and_text(
images: Optional[Any] = None,
text: Optional[Union[str, List[str]]] = None,
**kwargs,
) Dict[str, torch.Tensor]#

Process images and text together using the processor.

Parameters:
  • images – Input images

  • text – Input text

  • **kwargs – Additional processor arguments

Returns:

Processed inputs ready for the model

save_pretrained(save_directory: Union[str, pathlib.Path])#

Save the model and all components to a directory.

Parameters:

save_directory – Directory to save to

to(device: Union[str, torch.device]) PreTrainedVLM[VLMType]#

Move model to a device.

Parameters:

device – Target device

Returns:

Self for chaining

half() PreTrainedVLM[VLMType]#

Convert model to half precision.

Returns:

Self for chaining

float() PreTrainedVLM[VLMType]#

Convert model to full precision.

Returns:

Self for chaining

property dtype: Optional[torch.dtype]#

Return the dtype of the model.

num_parameters(only_trainable: bool = False) int#

Get the number of parameters in the model.

Parameters:

only_trainable – Whether to count only trainable parameters

Returns:

Number of parameters

__repr__() str#

String representation.