bridge.models.hf_pretrained.vlm
#
Module Contents#
Classes#
A generic class for Pretrained Vision-Language Models with lazy loading. |
Data#
API#
- bridge.models.hf_pretrained.vlm.VLMType#
‘TypeVar(…)’
- class bridge.models.hf_pretrained.vlm.PreTrainedVLM(
- model_name_or_path: Optional[Union[str, pathlib.Path]] = None,
- device: Optional[Union[str, torch.device]] = None,
- torch_dtype: Optional[torch.dtype] = None,
- trust_remote_code: bool = False,
- **kwargs,
Bases:
megatron.bridge.models.hf_pretrained.base.PreTrainedBase
,typing.Generic
[bridge.models.hf_pretrained.vlm.VLMType
]A generic class for Pretrained Vision-Language Models with lazy loading.
Allows type-safe access to specific VLM implementations like LlavaForConditionalGeneration.
.. rubric:: Examples
Basic usage with image and text:
from megatron.bridge.models.hf_pretrained.vlm import PreTrainedVLM from PIL import Image
Create instance - no model loading happens yet
vlm = PreTrainedVLM.from_pretrained(“llava-hf/llava-1.5-7b-hf”)
Load an image
image = Image.open(“cat.jpg”)
Process image and text together - processor and model load here
inputs = vlm.process_images_and_text( … images=image, … text=”What do you see in this image?” … )
Generate response
outputs = vlm.generate(**inputs, max_new_tokens=100) print(vlm.decode(outputs[0], skip_special_tokens=True))
Batch processing with multiple images:
Process multiple images with questions
images = [Image.open(f”image_{i}.jpg”) for i in range(3)] questions = [ … “What is the main object in this image?”, … “Describe the scene”, … “What colors do you see?” … ]
Process batch
inputs = vlm.process_images_and_text( … images=images, … text=questions, … padding=True … )
Generate responses
outputs = vlm.generate(**inputs, max_new_tokens=50) for i, output in enumerate(outputs): … print(f”Image {i+1}: {vlm.decode(output, skip_special_tokens=True)}”)
Using specific VLM types with type hints:
from transformers import LlavaForConditionalGeneration from megatron.bridge.models.hf_pretrained.vlm import PreTrainedVLM
Type-safe access to Llava-specific features
llava: PreTrainedVLM[LlavaForConditionalGeneration] = PreTrainedVLM.from_pretrained( … “llava-hf/llava-1.5-7b-hf”, … torch_dtype=torch.float16, … device=”cuda” … )
Access model-specific attributes
vision_tower = llava.model.vision_tower # Type-safe access
Text-only generation (for multimodal models that support it):
Some VLMs can also work with text-only inputs
text_inputs = vlm.encode_text(“Explain what a neural network is.”) outputs = vlm.generate(**text_inputs, max_length=100) print(vlm.decode(outputs[0], skip_special_tokens=True))
Custom preprocessing and generation:
Load with custom settings
vlm = PreTrainedVLM.from_pretrained( … “Qwen/Qwen-VL-Chat”, … trust_remote_code=True, … device_map=”auto”, … load_in_4bit=True … )
Custom generation config
from transformers import GenerationConfig vlm.generation_config = GenerationConfig( … max_new_tokens=200, … temperature=0.8, … top_p=0.95, … do_sample=True … )
Process with custom parameters
inputs = vlm.process_images_and_text( … images=image, … text=”
\nDescribe this image in detail.”, … max_length=512 … ) Manual component setup:
Create empty instance
vlm = PreTrainedVLM()
Load components separately
from transformers import AutoProcessor, AutoModel vlm.processor = AutoProcessor.from_pretrained(“microsoft/Florence-2-base”) vlm.model = AutoModel.from_pretrained(“microsoft/Florence-2-base”)
Use for various vision tasks
task_prompt = “
” # Object detection task inputs = vlm.process_images_and_text(images=image, text=task_prompt) outputs = vlm.generate(**inputs) Conversational VLM usage:
Multi-turn conversation with images
conversation = []
First turn
image1 = Image.open(“chart.png”) inputs = vlm.process_images_and_text( … images=image1, … text=”What type of chart is this?” … ) response = vlm.generate(**inputs) conversation.append((“user”, “What type of chart is this?”)) conversation.append((“assistant”, vlm.decode(response[0])))
Follow-up question
follow_up = “What is the highest value shown?”
Format conversation history + new question
full_prompt = format_conversation(conversation) + f”\nUser: {follow_up}” inputs = vlm.process_images_and_text(images=image1, text=full_prompt) response = vlm.generate(**inputs)
Initialization
Initialize a Pretrained VLM with lazy loading.
- Parameters:
model_name_or_path – HuggingFace model identifier or local path
device – Device to load model on (e.g., ‘cuda’, ‘cpu’)
torch_dtype – Data type to load model in (e.g., torch.float16)
trust_remote_code – Whether to trust remote code when loading
**kwargs – Additional arguments passed to component loaders
- ARTIFACTS#
[‘processor’, ‘tokenizer’, ‘image_processor’]
- OPTIONAL_ARTIFACTS#
[‘generation_config’]
- _load_model() bridge.models.hf_pretrained.vlm.VLMType #
Lazy load and return the model.
- _load_config() transformers.AutoConfig #
Lazy load and return the model config.
- _load_processor() transformers.ProcessorMixin #
Lazy load and return the processor.
- _load_tokenizer() Optional[transformers.PreTrainedTokenizer] #
Lazy load and return the tokenizer. For VLMs, the tokenizer might be included in the processor.
- _load_image_processor() Optional[Any] #
Lazy load and return the image processor. For VLMs, the image processor might be included in the processor.
- _load_generation_config() Optional[transformers.GenerationConfig] #
Lazy load and return the generation config.
- property model_name_or_path: Optional[Union[str, pathlib.Path]]#
Return the model name or path.
- property model: bridge.models.hf_pretrained.vlm.VLMType#
Lazy load and return the underlying model.
- property processor: transformers.ProcessorMixin#
Lazy load and return the processor.
- property tokenizer: Optional[transformers.PreTrainedTokenizer]#
Lazy load and return the tokenizer.
- property image_processor: Optional[Any]#
Lazy load and return the image processor.
- property generation_config: Optional[transformers.GenerationConfig]#
Lazy load and return the generation config.
- property kwargs: Dict[str, Any]#
Additional initialization kwargs.
- classmethod from_pretrained(
- model_name_or_path: Union[str, pathlib.Path],
- device: Optional[Union[str, torch.device]] = None,
- torch_dtype: Optional[torch.dtype] = None,
- trust_remote_code: bool = False,
- **kwargs,
Create a PreTrainedVLM instance for lazy loading.
- Parameters:
model_name_or_path – HuggingFace model identifier or local path
device – Device to load model on
torch_dtype – Data type to load model in
trust_remote_code – Whether to trust remote code
**kwargs – Additional arguments for from_pretrained methods
- Returns:
PreTrainedVLM instance configured for lazy loading
- generate(
- **kwargs,
Generate sequences using the model.
- Parameters:
**kwargs – Arguments for the generate method
- Returns:
Generated sequences
- __call__(*args, **kwargs)#
Forward pass through the model.
- encode_text(
- text: Union[str, List[str]],
- **kwargs,
Encode text input using the tokenizer.
- Parameters:
text – Input text or list of texts
**kwargs – Additional tokenizer arguments
- Returns:
Encoded inputs ready for the model
- decode(token_ids: torch.Tensor, **kwargs) str #
Decode token IDs to text.
- Parameters:
token_ids – Token IDs to decode
**kwargs – Additional decoding arguments
- Returns:
Decoded text
- process_images_and_text(
- images: Optional[Any] = None,
- text: Optional[Union[str, List[str]]] = None,
- **kwargs,
Process images and text together using the processor.
- Parameters:
images – Input images
text – Input text
**kwargs – Additional processor arguments
- Returns:
Processed inputs ready for the model
- save_pretrained(save_directory: Union[str, pathlib.Path])#
Save the model and all components to a directory.
- Parameters:
save_directory – Directory to save to
- to(device: Union[str, torch.device]) PreTrainedVLM[VLMType] #
Move model to a device.
- Parameters:
device – Target device
- Returns:
Self for chaining
- half() PreTrainedVLM[VLMType] #
Convert model to half precision.
- Returns:
Self for chaining
- float() PreTrainedVLM[VLMType] #
Convert model to full precision.
- Returns:
Self for chaining
- property dtype: Optional[torch.dtype]#
Return the dtype of the model.
- num_parameters(only_trainable: bool = False) int #
Get the number of parameters in the model.
- Parameters:
only_trainable – Whether to count only trainable parameters
- Returns:
Number of parameters
- __repr__() str #
String representation.