`bridge.models.hf_pretrained.vlm`#

Module Contents#

Classes#

PreTrainedVLM

A generic class for Pretrained Vision-Language Models with lazy loading.

Data#

VLMType

API#

bridge.models.hf_pretrained.vlm.VLMType#: ‘TypeVar(…)’

class bridge.models.hf_pretrained.vlm.PreTrainedVLM(

model_name_or_path: Optional[Union[str, pathlib.Path]] = None,

device: Optional[Union[str, torch.device]] = None,

torch_dtype: Optional[torch.dtype] = None,

trust_remote_code: bool = False,

**kwargs,

)#

Bases: megatron.bridge.models.hf_pretrained.base.PreTrainedBase, typing.Generic[bridge.models.hf_pretrained.vlm.VLMType]

A generic class for Pretrained Vision-Language Models with lazy loading.

Allows type-safe access to specific VLM implementations like LlavaForConditionalGeneration.

.. rubric:: Examples

Basic usage with image and text:

from megatron.bridge.models.hf_pretrained.vlm import PreTrainedVLM from PIL import Image

Create instance - no model loading happens yet

vlm = PreTrainedVLM.from_pretrained(“llava-hf/llava-1.5-7b-hf”)

Load an image

image = Image.open(“cat.jpg”)

Process image and text together - processor and model load here

inputs = vlm.process_images_and_text( … images=image, … text=”What do you see in this image?” … )

Generate response

outputs = vlm.generate(**inputs, max_new_tokens=100) print(vlm.decode(outputs[0], skip_special_tokens=True))

Batch processing with multiple images:

Process multiple images with questions

images = [Image.open(f”image_{i}.jpg”) for i in range(3)] questions = [ … “What is the main object in this image?”, … “Describe the scene”, … “What colors do you see?” … ]

Process batch

inputs = vlm.process_images_and_text( … images=images, … text=questions, … padding=True … )

Generate responses

outputs = vlm.generate(**inputs, max_new_tokens=50) for i, output in enumerate(outputs): … print(f”Image {i+1}: {vlm.decode(output, skip_special_tokens=True)}”)

Using specific VLM types with type hints:

from transformers import LlavaForConditionalGeneration from megatron.bridge.models.hf_pretrained.vlm import PreTrainedVLM

Type-safe access to Llava-specific features

llava: PreTrainedVLM[LlavaForConditionalGeneration] = PreTrainedVLM.from_pretrained( … “llava-hf/llava-1.5-7b-hf”, … torch_dtype=torch.float16, … device=”cuda” … )

Access model-specific attributes

vision_tower = llava.model.vision_tower # Type-safe access

Text-only generation (for multimodal models that support it):

Some VLMs can also work with text-only inputs

text_inputs = vlm.encode_text(“Explain what a neural network is.”) outputs = vlm.generate(**text_inputs, max_length=100) print(vlm.decode(outputs[0], skip_special_tokens=True))

Custom preprocessing and generation:

Load with custom settings

vlm = PreTrainedVLM.from_pretrained( … “Qwen/Qwen-VL-Chat”, … trust_remote_code=True, … device_map=”auto”, … load_in_4bit=True … )

Custom generation config

from transformers import GenerationConfig vlm.generation_config = GenerationConfig( … max_new_tokens=200, … temperature=0.8, … top_p=0.95, … do_sample=True … )

Process with custom parameters

inputs = vlm.process_images_and_text( … images=image, … text=”\nDescribe this image in detail.”, … max_length=512 … )

Manual component setup:

Create empty instance

vlm = PreTrainedVLM()

Load components separately

from transformers import AutoProcessor, AutoModel vlm.processor = AutoProcessor.from_pretrained(“microsoft/Florence-2-base”) vlm.model = AutoModel.from_pretrained(“microsoft/Florence-2-base”)

Use for various vision tasks

task_prompt = “” # Object detection task inputs = vlm.process_images_and_text(images=image, text=task_prompt) outputs = vlm.generate(**inputs)

Conversational VLM usage:

Multi-turn conversation with images

conversation = []

First turn

image1 = Image.open(“chart.png”) inputs = vlm.process_images_and_text( … images=image1, … text=”What type of chart is this?” … ) response = vlm.generate(**inputs) conversation.append((“user”, “What type of chart is this?”)) conversation.append((“assistant”, vlm.decode(response[0])))

Follow-up question

follow_up = “What is the highest value shown?”

Format conversation history + new question

full_prompt = format_conversation(conversation) + f”\nUser: {follow_up}” inputs = vlm.process_images_and_text(images=image1, text=full_prompt) response = vlm.generate(**inputs)

Initialization

Initialize a Pretrained VLM with lazy loading.

Parameters:

model_name_or_path – HuggingFace model identifier or local path
device – Device to load model on (e.g., ‘cuda’, ‘cpu’)
torch_dtype – Data type to load model in (e.g., torch.float16)
trust_remote_code – Whether to trust remote code when loading
**kwargs – Additional arguments passed to component loaders

ARTIFACTS#: [‘processor’, ‘tokenizer’, ‘image_processor’]

OPTIONAL_ARTIFACTS#: [‘generation_config’]

_load_model() → bridge.models.hf_pretrained.vlm.VLMType#: Lazy load and return the model.

_load_config() → transformers.AutoConfig#: Lazy load and return the model config with thread-safety protection.

_load_processor() → transformers.ProcessorMixin#: Lazy load and return the processor.

_load_tokenizer() → Optional[transformers.PreTrainedTokenizer]#: Lazy load and return the tokenizer. For VLMs, the tokenizer might be included in the processor.

_load_image_processor() → Optional[Any]#: Lazy load and return the image processor. For VLMs, the image processor might be included in the processor.

_load_generation_config() → Optional[transformers.GenerationConfig]#: Lazy load and return the generation config.

property model_name_or_path: Optional[Union[str, pathlib.Path]]#: Return the model name or path.

property model: bridge.models.hf_pretrained.vlm.VLMType#: Lazy load and return the underlying model.

property processor: transformers.ProcessorMixin#: Lazy load and return the processor.

property tokenizer: Optional[transformers.PreTrainedTokenizer]#: Lazy load and return the tokenizer.

property image_processor: Optional[Any]#: Lazy load and return the image processor.

property generation_config: Optional[transformers.GenerationConfig]#: Lazy load and return the generation config.

property kwargs: Dict[str, Any]#: Additional initialization kwargs.

classmethod from_pretrained(

model_name_or_path: Union[str, pathlib.Path],

device: Optional[Union[str, torch.device]] = None,

torch_dtype: Optional[torch.dtype] = None,

trust_remote_code: bool = False,

**kwargs,

) → PreTrainedVLM[VLMType]#

Create a PreTrainedVLM instance for lazy loading.

Parameters:

model_name_or_path – HuggingFace model identifier or local path
device – Device to load model on
torch_dtype – Data type to load model in
trust_remote_code – Whether to trust remote code
**kwargs – Additional arguments for from_pretrained methods

Returns:

PreTrainedVLM instance configured for lazy loading

generate(

**kwargs,

) → Union[torch.LongTensor, transformers.generation.utils.GenerateOutput]#

Generate sequences using the model.

Parameters:: **kwargs – Arguments for the generate method
Returns:: Generated sequences

__call__(*args, **kwargs)#: Forward pass through the model.

encode_text(

text: Union[str, List[str]],

**kwargs,

) → Dict[str, torch.Tensor]#

Encode text input using the tokenizer.

Parameters:

text – Input text or list of texts
**kwargs – Additional tokenizer arguments

Returns:

Encoded inputs ready for the model

decode(token_ids: torch.Tensor, **kwargs) → str#

Decode token IDs to text.

Parameters:

token_ids – Token IDs to decode
**kwargs – Additional decoding arguments

Returns:

Decoded text

process_images_and_text(

images: Optional[Any] = None,

text: Optional[Union[str, List[str]]] = None,

**kwargs,

) → Dict[str, torch.Tensor]#

Process images and text together using the processor.

Parameters:

images – Input images
text – Input text
**kwargs – Additional processor arguments

Returns:

Processed inputs ready for the model

save_pretrained(save_directory: Union[str, pathlib.Path])#

Save the model and all components to a directory.

Parameters:: save_directory – Directory to save to

to(device: Union[str, torch.device]) → PreTrainedVLM[VLMType]#

Move model to a device.

Parameters:: device – Target device
Returns:: Self for chaining

half() → PreTrainedVLM[VLMType]#

Convert model to half precision.

Returns:: Self for chaining

float() → PreTrainedVLM[VLMType]#

Convert model to full precision.

Returns:: Self for chaining

property dtype: Optional[torch.dtype]#: Return the dtype of the model.

num_parameters(only_trainable: bool = False) → int#

Get the number of parameters in the model.

Parameters:: only_trainable – Whether to count only trainable parameters
Returns:: Number of parameters

__repr__() → str#: String representation.

bridge.models.hf_pretrained.vlm#

Module Contents#

Classes#

Data#

API#

`bridge.models.hf_pretrained.vlm`#