`bridge.models.nemotron_vl.nemotron_vl_utils`#

Module Contents#

Functions#

`encode_pil_to_jpeg_data_url`	Encode a PIL image to a base64-encoded data URL.
`sample_video_frames_to_data_urls`	Sample frames from a video and return base64-encoded data URLs along with metadata.
`maybe_path_or_url_to_data_urls`	Convert a path or URL to data URLs, handling videos, images, and remote files.
`pil_image_from_base64`	Decode a base64-encoded image to a PIL image.
`adjust_image_tokens`	Ensures the input_ids tensor contains the correct number of tokens as specified by num_tiles. This adjustment is necessary to bridge the gap between from HF processor to Megatron LLaVAModel.

API#

bridge.models.nemotron_vl.nemotron_vl_utils.encode_pil_to_jpeg_data_url(pil_image)#: Encode a PIL image to a base64-encoded data URL.

bridge.models.nemotron_vl.nemotron_vl_utils.sample_video_frames_to_data_urls( video_path_local, fps=1, nframe=0, nframe_max=-1, )#

Sample frames from a video and return base64-encoded data URLs along with metadata.

Parameters:

video_path_local – Path to the video file
fps – Target frames per second for sampling (if > 0, uses fps-based sampling)
nframe – Number of frames to sample (used if fps <= 0)
nframe_max – Maximum number of frames to sample

Returns:

(frame_data_urls, metadata)

frame_data_urls: List of base64-encoded frame images
metadata: VideoMetadata dataclass containing info about the sampled frames:
- total_num_frames: Number of sampled frames
- fps: Effective frame rate of the sampled frames
- duration: Duration covered by the sampled frames (in seconds)
- video_backend: Backend used for video processing (‘decord’)

Return type:

tuple

bridge.models.nemotron_vl.nemotron_vl_utils.maybe_path_or_url_to_data_urls( path_or_url, fps=1, nframe=0, nframe_max=-1, )#

Convert a path or URL to data URLs, handling videos, images, and remote files.

Parameters:

path_or_url – Path or URL to the media file
fps – Target frames per second for video sampling (if > 0, uses fps-based sampling)
nframe – Number of frames to sample from video (used if fps <= 0)
nframe_max – Maximum number of frames to sample

Returns:

(data_urls, metadata)

data_urls: List of base64-encoded data URLs
metadata: VideoMetadata dataclass with video metadata or None for images

Return type:

tuple

bridge.models.nemotron_vl.nemotron_vl_utils.pil_image_from_base64(b64_str: str) → PIL.Image.Image#: Decode a base64-encoded image to a PIL image.

bridge.models.nemotron_vl.nemotron_vl_utils.adjust_image_tokens( input_ids: torch.Tensor | Dict[str, torch.Tensor], num_tiles: int | List[int], img_start_token_id: int, img_end_token_id: int, ) → torch.Tensor | Dict[str, torch.Tensor]#

Ensures the input_ids tensor contains the correct number of tokens as specified by num_tiles. This adjustment is necessary to bridge the gap between from HF processor to Megatron LLaVAModel.

.. rubric:: Example

input_ids decoded may look like this System: … User:… Image 1: … # adjust number of tokens to be num_tiles[0] Image 2: … # adjust number of tokens to be num_tiles[1] … etc

Parameters:

input_ids – The input_ids tensor (output of HF processor) or a dictionary of tensors, one of the keys of which must be “input_ids”, and other tensors must have the same shape as input_ids
num_tiles – The number of tokens to ensure, either a single int or a list of ints
img_start_token_id – The token id of
img_end_token_id – The token id of

Returns:

The input_ids tensor with the correct number of tokens or a dictionary of tensors each with the same shape as input_ids

bridge.models.nemotron_vl.nemotron_vl_utils#

Module Contents#

Functions#

API#

`bridge.models.nemotron_vl.nemotron_vl_utils`#