nemo_curator.utils.windowing_utils

View as Markdown

Module Contents

Classes

NameDescription
WindowFrameInfoContainer for frame window information, storing start and end frame indices.

Functions

NameDescription
ceil_by_factorReturn the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’.
compute_windowsGenerate windows by splitting the video into segments of the specified size.
fetch_videoLoad and preprocess video frames from a file.
floor_by_factorReturn the largest integer less than or equal to ‘number’ that is divisible by ‘factor’.
read_video_cpuRead video using PyAv.
round_by_factorReturn the closest integer to ‘number’ that is divisible by ‘factor’.
smart_nframesCalculate the number of frames for video used for model inputs.
smart_resizeRescales the image so that the following conditions are met.
split_video_into_windowsCalculate windows and return video inputs for language model from input clips.

Data

FPS

FPS_MAX_FRAMES

FPS_MIN_FRAMES

FRAME_FACTOR

IMAGE_FACTOR

MAX_PIXELS

MAX_RATIO

MIN_PIXELS

OPENAI_CLIP_MEAN

OPENAI_CLIP_STD

VIDEO_MAX_PIXELS

VIDEO_MIN_PIXELS

VIDEO_TOTAL_PIXELS

WINDOW_MIN_FRAMES

API

class nemo_curator.utils.windowing_utils.WindowFrameInfo(
start: int,
end: int
)
Dataclass

Container for frame window information, storing start and end frame indices.

This class represents a window of frames in a video, defined by its start and end frame positions.

end
int
start
int
nemo_curator.utils.windowing_utils.ceil_by_factor(
number: float,
factor: int
) -> int

Return the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’.

nemo_curator.utils.windowing_utils.compute_windows(
total_frames: int,
window_size: int = 128,
remainder_threshold: int = 64
) -> list[nemo_curator.utils.windowing_utils.WindowFrameInfo]

Generate windows by splitting the video into segments of the specified size.

Parameters:

total_frames
int

total frames

window_size
intDefaults to 128

The size of each window in number of frames.

remainder_threshold
intDefaults to 64

The minimum number of frames required to create a new window from the remainder.

nemo_curator.utils.windowing_utils.fetch_video(
video_path: str,
sampling_fps: float = 2.0,
window_range: list[nemo_curator.utils.windowing_utils.WindowFrameInfo] | None = None,
do_preprocess: bool = False,
preprocess_dtype: str = 'float32',
num_frames_to_use: int = 0,
flip_input: bool = False
) -> tuple[torch.Tensor, list[int]]

Load and preprocess video frames from a file.

Parameters:

video_path
str

Path to the video file.

sampling_fps
floatDefaults to 2.0

Target frames per second for sampling.

window_range
list[WindowFrameInfo] | NoneDefaults to None

List of frame windows to extract.

do_preprocess
boolDefaults to False

Whether to preprocess the frames.

preprocess_dtype
strDefaults to 'float32'

Data type for preprocessing.

num_frames_to_use
intDefaults to 0

Number of frames to extract (0 for all).

flip_input
boolDefaults to False

Whether to flip frames horizontally.

Returns: tuple[torch.Tensor, list[int]]

Tuple of (processed frames tensor, frame indices).

nemo_curator.utils.windowing_utils.floor_by_factor(
number: float,
factor: int
) -> int

Return the largest integer less than or equal to ‘number’ that is divisible by ‘factor’.

nemo_curator.utils.windowing_utils.read_video_cpu(
video_path: str,
fps: float,
num_frames_to_use: int,
window_range: list[nemo_curator.utils.windowing_utils.WindowFrameInfo]
) -> tuple[torch.Tensor, list[int]]

Read video using PyAv.

Parameters:

video_path
str

path to the video support “file://”, “http://”, “https://” and local path.

fps
float

frames per second

num_frames_to_use
int

number of frames to use

window_range
list[WindowFrameInfo]

window range

Returns: tuple[torch.Tensor, list[int]]

torch.Tensor: the video tensor with shape (T, C, H, W).

nemo_curator.utils.windowing_utils.round_by_factor(
number: float,
factor: int
) -> int

Return the closest integer to ‘number’ that is divisible by ‘factor’.

nemo_curator.utils.windowing_utils.smart_nframes(
fps: float,
total_frames: int,
video_fps: float
) -> int

Calculate the number of frames for video used for model inputs.

nemo_curator.utils.windowing_utils.smart_resize(
height: int,
width: int,
factor: int = IMAGE_FACTOR,
min_pixels: int = MIN_PIXELS,
max_pixels: int = MAX_PIXELS
) -> tuple[int, int]

Rescales the image so that the following conditions are met.

  1. Both dimensions (height and width) are divisible by ‘factor’.

  2. The total number of pixels is within the range [‘min_pixels’, ‘max_pixels’].

  3. The aspect ratio of the image is maintained as closely as possible.

nemo_curator.utils.windowing_utils.split_video_into_windows(
mp4_bytes: bytes,
window_size: int = 256,
remainder_threshold: int = 128,
sampling_fps: float = 2.0,
model_does_preprocess: bool = False,
preprocess_dtype: str = 'uint8',
flip_input: bool = False,
num_frames_to_use: int = 0,
return_bytes: bool = False,
return_video_frames: bool = True,
num_threads: int = 1
) -> tuple[list[bytes], list[torch.Tensor | None], list[nemo_curator.utils.windowing_utils.WindowFrameInfo]]

Calculate windows and return video inputs for language model from input clips.

Processes video to determine the windows for a clip, decode in one shot and return processed frames for each window in a format suitable for consumption by the Qwen model.

Parameters:

mp4_bytes
bytes

input video in bytes

fps

Frames per second of the input video.

preprocess_dtype
strDefaults to 'uint8'

Data type to use for preprocessing the video/image inputs.

num_frames_to_use
intDefaults to 0

Number of frames to extract from the video. If 0, uses all frames.

flip_input
boolDefaults to False

Whether to flip the input video/image horizontally.

return_bytes
boolDefaults to False

Whether to extract mp4 bytes for each window for use by PreviewStage

model_does_preprocess
boolDefaults to False

if the model does preprocessing

num_threads
intDefaults to 1

number of threads

remainder_threshold
intDefaults to 128

threshold for remainder

return_video_frames
boolDefaults to True

whether to return video frames

sampling_fps
floatDefaults to 2.0

sampling fps

window_size
intDefaults to 256

window size

Returns: tuple[list[bytes], list[torch.Tensor | None], list[WindowFrameInfo]]

Tuple containing:

  • “window_mp4_bytes”: mp4 bytes corresponding to each window - only used when Preview stage is enabled
  • “window_frames”: Decoded and per-window processed frames ready for use by Qwen model
  • “window info”: start and end frame indices for each window in a clip
nemo_curator.utils.windowing_utils.FPS = 2.0
nemo_curator.utils.windowing_utils.FPS_MAX_FRAMES = 768
nemo_curator.utils.windowing_utils.FPS_MIN_FRAMES = 4
nemo_curator.utils.windowing_utils.FRAME_FACTOR = 2
nemo_curator.utils.windowing_utils.IMAGE_FACTOR = 28
nemo_curator.utils.windowing_utils.MAX_PIXELS = 16384 * 28 * 28
nemo_curator.utils.windowing_utils.MAX_RATIO = 200
nemo_curator.utils.windowing_utils.MIN_PIXELS = 4 * 28 * 28
nemo_curator.utils.windowing_utils.OPENAI_CLIP_MEAN = [0.48145466, 0.4578275, 0.40821073]
nemo_curator.utils.windowing_utils.OPENAI_CLIP_STD = [0.26862954, 0.26130258, 0.27577711]
nemo_curator.utils.windowing_utils.VIDEO_MAX_PIXELS = 768 * 28 * 28
nemo_curator.utils.windowing_utils.VIDEO_MIN_PIXELS = 128 * 28 * 28
nemo_curator.utils.windowing_utils.VIDEO_TOTAL_PIXELS = 24576 * 28 * 28
nemo_curator.utils.windowing_utils.WINDOW_MIN_FRAMES = 4