bridge.models.nemotron_vl.nemotron_vl_utils#

Module Contents#

Functions#

encode_pil_to_jpeg_data_url

Encode a PIL image to a base64-encoded data URL.

sample_video_frames_to_data_urls

Sample frames from a video and return base64-encoded data URLs along with metadata.

maybe_path_or_url_to_data_urls

Convert a path or URL to data URLs, handling videos, images, and remote files.

pil_image_from_base64

Decode a base64-encoded image to a PIL image.

adjust_image_tokens

Ensures the input_ids tensor contains the correct number of tokens as specified by num_tiles. This adjustment is necessary to bridge the gap between from HF processor to Megatron LLaVAModel.

API#

bridge.models.nemotron_vl.nemotron_vl_utils.encode_pil_to_jpeg_data_url(pil_image)#

Encode a PIL image to a base64-encoded data URL.

bridge.models.nemotron_vl.nemotron_vl_utils.sample_video_frames_to_data_urls(
video_path_local,
fps=1,
nframe=0,
nframe_max=-1,
)#

Sample frames from a video and return base64-encoded data URLs along with metadata.

Parameters:
  • video_path_local – Path to the video file

  • fps – Target frames per second for sampling (if > 0, uses fps-based sampling)

  • nframe – Number of frames to sample (used if fps <= 0)

  • nframe_max – Maximum number of frames to sample

Returns:

(frame_data_urls, metadata)

  • frame_data_urls: List of base64-encoded frame images

  • metadata: VideoMetadata dataclass containing info about the sampled frames:

    • total_num_frames: Number of sampled frames

    • fps: Effective frame rate of the sampled frames

    • duration: Duration covered by the sampled frames (in seconds)

    • video_backend: Backend used for video processing (β€˜decord’)

Return type:

tuple

bridge.models.nemotron_vl.nemotron_vl_utils.maybe_path_or_url_to_data_urls(
path_or_url,
fps=1,
nframe=0,
nframe_max=-1,
)#

Convert a path or URL to data URLs, handling videos, images, and remote files.

Parameters:
  • path_or_url – Path or URL to the media file

  • fps – Target frames per second for video sampling (if > 0, uses fps-based sampling)

  • nframe – Number of frames to sample from video (used if fps <= 0)

  • nframe_max – Maximum number of frames to sample

Returns:

(data_urls, metadata)

  • data_urls: List of base64-encoded data URLs

  • metadata: VideoMetadata dataclass with video metadata or None for images

Return type:

tuple

bridge.models.nemotron_vl.nemotron_vl_utils.pil_image_from_base64(b64_str: str) PIL.Image.Image#

Decode a base64-encoded image to a PIL image.

bridge.models.nemotron_vl.nemotron_vl_utils.adjust_image_tokens(
input_ids: torch.Tensor | Dict[str, torch.Tensor],
num_tiles: int | List[int],
img_start_token_id: int,
img_end_token_id: int,
) torch.Tensor | Dict[str, torch.Tensor]#

Ensures the input_ids tensor contains the correct number of tokens as specified by num_tiles. This adjustment is necessary to bridge the gap between from HF processor to Megatron LLaVAModel.

.. rubric:: Example

input_ids decoded may look like this System: … User:… Image 1: … # adjust number of tokens to be num_tiles[0] Image 2: … # adjust number of tokens to be num_tiles[1] … etc

Parameters:
  • input_ids – The input_ids tensor (output of HF processor) or a dictionary of tensors, one of the keys of which must be β€œinput_ids”, and other tensors must have the same shape as input_ids

  • num_tiles – The number of tokens to ensure, either a single int or a list of ints

  • img_start_token_id – The token id of

  • img_end_token_id – The token id of

Returns:

The input_ids tensor with the correct number of tokens or a dictionary of tensors each with the same shape as input_ids