nemo_curator.utils.windowing_utils

Module Contents

Classes

Name	Description
`WindowFrameInfo`	Container for frame window information, storing start and end frame indices.

Functions

Name	Description
`ceil_by_factor`	Return the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’.
`compute_windows`	Generate windows by splitting the video into segments of the specified size.
`fetch_video`	Load and preprocess video frames from a file.
`floor_by_factor`	Return the largest integer less than or equal to ‘number’ that is divisible by ‘factor’.
`read_video_cpu`	Read video using PyAv.
`round_by_factor`	Return the closest integer to ‘number’ that is divisible by ‘factor’.
`smart_nframes`	Calculate the number of frames for video used for model inputs.
`smart_resize`	Rescales the image so that the following conditions are met.
`split_video_into_windows`	Calculate windows and return video inputs for language model from input clips.

Data

API

class nemo_curator.utils.windowing_utils.WindowFrameInfo(
    start: int,
    end: int
)

Dataclass

Container for frame window information, storing start and end frame indices.

This class represents a window of frames in a video, defined by its start and end frame positions.

end

int

start

int

nemo_curator.utils.windowing_utils.ceil_by_factor(
    number: float,
    factor: int
) -> int

Return the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’.

nemo_curator.utils.windowing_utils.compute_windows(
    total_frames: int,
    window_size: int = 128,
    remainder_threshold: int = 64
) -> list[nemo_curator.utils.windowing_utils.WindowFrameInfo]

Generate windows by splitting the video into segments of the specified size.

Parameters:

total_frames

int

total frames

window_size

intDefaults to 128

The size of each window in number of frames.

remainder_threshold

intDefaults to 64

The minimum number of frames required to create a new window from the remainder.

nemo_curator.utils.windowing_utils.fetch_video(
    video_path: str,
    sampling_fps: float = 2.0,
    window_range: list[nemo_curator.utils.windowing_utils.WindowFrameInfo] | None = None,
    do_preprocess: bool = False,
    preprocess_dtype: str = 'float32',
    num_frames_to_use: int = 0,
    flip_input: bool = False,
    skip_resize: bool = False
) -> tuple[torch.Tensor, list[int]]

Load and preprocess video frames from a file.

Parameters:

video_path

str

Path to the video file.

sampling_fps

floatDefaults to 2.0

Target frames per second for sampling.

window_range

list[WindowFrameInfo] | NoneDefaults to None

List of frame windows to extract.

do_preprocess

boolDefaults to False

Whether to preprocess the frames.

preprocess_dtype

strDefaults to 'float32'

Data type for preprocessing.

num_frames_to_use

intDefaults to 0

Number of frames to extract (0 for all).

flip_input

boolDefaults to False

Whether to flip frames horizontally.

Returns: tuple[torch.Tensor, list[int]]

Tuple of (processed frames tensor, frame indices).

nemo_curator.utils.windowing_utils.floor_by_factor(
    number: float,
    factor: int
) -> int

Return the largest integer less than or equal to ‘number’ that is divisible by ‘factor’.

nemo_curator.utils.windowing_utils.read_video_cpu(
    video_path: str,
    fps: float,
    num_frames_to_use: int,
    window_range: list[nemo_curator.utils.windowing_utils.WindowFrameInfo]
) -> tuple[torch.Tensor, list[int]]

Read video using PyAv.

Parameters:

video_path

str

path to the video support “file://”, “http://”, “https://” and local path.

fps

float

frames per second

num_frames_to_use

int

number of frames to use

window_range

list[WindowFrameInfo]

window range

Returns: tuple[torch.Tensor, list[int]]

torch.Tensor: the video tensor with shape (T, C, H, W).

nemo_curator.utils.windowing_utils.round_by_factor(
    number: float,
    factor: int
) -> int

Return the closest integer to ‘number’ that is divisible by ‘factor’.

nemo_curator.utils.windowing_utils.smart_nframes(
    fps: float,
    total_frames: int,
    video_fps: float
) -> int

Calculate the number of frames for video used for model inputs.

nemo_curator.utils.windowing_utils.smart_resize(
    height: int,
    width: int,
    factor: int = IMAGE_FACTOR,
    min_pixels: int = MIN_PIXELS,
    max_pixels: int = MAX_PIXELS
) -> tuple[int, int]

Rescales the image so that the following conditions are met.

Both dimensions (height and width) are divisible by ‘factor’.
The total number of pixels is within the range [‘min_pixels’, ‘max_pixels’].
The aspect ratio of the image is maintained as closely as possible.

nemo_curator.utils.windowing_utils.split_video_into_windows(
    mp4_bytes: bytes,
    window_size: int = 256,
    remainder_threshold: int = 128,
    sampling_fps: float = 2.0,
    model_does_preprocess: bool = False,
    preprocess_dtype: str = 'uint8',
    skip_resize: bool = False,
    flip_input: bool = False,
    num_frames_to_use: int = 0,
    return_bytes: bool = False,
    return_video_frames: bool = True,
    num_threads: int = 1
) -> tuple[list[bytes], list[torch.Tensor | None], list[nemo_curator.utils.windowing_utils.WindowFrameInfo]]

Calculate windows and return video inputs for language model from input clips.

Processes video to determine the windows for a clip, decode in one shot and return processed frames for each window in a format suitable for consumption by the Qwen model.

Parameters:

mp4_bytes

bytes

input video in bytes

fps

Frames per second of the input video.

preprocess_dtype

strDefaults to 'uint8'

Data type to use for preprocessing the video/image inputs.

num_frames_to_use

intDefaults to 0

Number of frames to extract from the video. If 0, uses all frames.

flip_input

boolDefaults to False

Whether to flip the input video/image horizontally.

return_bytes

boolDefaults to False

Whether to extract mp4 bytes for each window for use by PreviewStage

model_does_preprocess

boolDefaults to False

if the model does preprocessing

num_threads

intDefaults to 1

number of threads

remainder_threshold

intDefaults to 128

threshold for remainder

return_video_frames

boolDefaults to True

whether to return video frames

sampling_fps

floatDefaults to 2.0

sampling fps

window_size

intDefaults to 256

window size

Returns: tuple[list[bytes], list[torch.Tensor | None], list[WindowFrameInfo]]

Tuple containing:

“window_mp4_bytes”: mp4 bytes corresponding to each window - only used when Preview stage is enabled
“window_frames”: Decoded and per-window processed frames ready for use by Qwen model
“window info”: start and end frame indices for each window in a clip

nemo_curator.utils.windowing_utils.FPS = 2.0

nemo_curator.utils.windowing_utils.FPS_MAX_FRAMES = 768

nemo_curator.utils.windowing_utils.FPS_MIN_FRAMES = 4

nemo_curator.utils.windowing_utils.FRAME_FACTOR = 2

nemo_curator.utils.windowing_utils.IMAGE_FACTOR = 28

nemo_curator.utils.windowing_utils.MAX_PIXELS = 16384 * 28 * 28

nemo_curator.utils.windowing_utils.MAX_RATIO = 200

nemo_curator.utils.windowing_utils.MIN_PIXELS = 4 * 28 * 28

nemo_curator.utils.windowing_utils.OPENAI_CLIP_MEAN = [0.48145466, 0.4578275, 0.40821073]

nemo_curator.utils.windowing_utils.OPENAI_CLIP_STD = [0.26862954, 0.26130258, 0.27577711]

nemo_curator.utils.windowing_utils.VIDEO_MAX_PIXELS = 768 * 28 * 28

nemo_curator.utils.windowing_utils.VIDEO_MIN_PIXELS = 128 * 28 * 28

nemo_curator.utils.windowing_utils.VIDEO_TOTAL_PIXELS = 24576 * 28 * 28

nemo_curator.utils.windowing_utils.WINDOW_MIN_FRAMES = 4