nemo_automodel.components.datasets.multimodal.transforms

View as Markdown

Image transforms for BAGEL’s NaViT-style aspect-ratio-aware resize.

Module Contents

Classes

NameDescription
ImageTransformFull BAGEL image transform: resize + to_tensor + normalize.
MaxLongEdgeMinShortEdgeResizeResize so longest/shortest edges stay within bounds and both edges are stride-divisible.

Functions

NameDescription
_require_torchvision-

API

class nemo_automodel.components.datasets.multimodal.transforms.ImageTransform(
max_image_size,
min_image_size,
image_stride,
max_pixels = 14 * 14 * 9 * 1024,
image_mean = (0.5, 0.5, 0.5),
image_std = (0.5, 0.5, 0.5)
)

Full BAGEL image transform: resize + to_tensor + normalize.

Used for both ViT input (stride=14) and VAE input (stride=16, via separate instances). stride is exposed as an attribute so the dataset can compute patch counts without knowing the transform class.

normalize_transform
resize_transform
to_tensor_transform
= tv_transforms.ToTensor()
nemo_automodel.components.datasets.multimodal.transforms.ImageTransform.__call__(
img,
img_num = 1
)
class nemo_automodel.components.datasets.multimodal.transforms.MaxLongEdgeMinShortEdgeResize(
max_size: int,
min_size: int,
stride: int,
max_pixels: int,
interpolation = None,
antialias = True
)

Bases: Module

Resize so longest/shortest edges stay within bounds and both edges are stride-divisible.

Parameters:

max_size
int

Maximum size for the longest edge.

min_size
int

Minimum size for the shortest edge.

stride
int

Value both edges must be divisible by (ViT patch size).

max_pixels
int

Maximum total pixels for the full image.

interpolation
Defaults to None

Torchvision interpolation mode (default bicubic).

antialias
Defaults to True

Whether to apply antialiasing.

nemo_automodel.components.datasets.multimodal.transforms.MaxLongEdgeMinShortEdgeResize._apply_scale(
width,
height,
scale
)
nemo_automodel.components.datasets.multimodal.transforms.MaxLongEdgeMinShortEdgeResize._make_divisible(
value,
stride
)
nemo_automodel.components.datasets.multimodal.transforms.MaxLongEdgeMinShortEdgeResize.forward(
img,
img_num = 1
)
nemo_automodel.components.datasets.multimodal.transforms._require_torchvision()