nemo_curator.utils.windowing_utils
Module Contents
Classes
Functions
Data
API
Container for frame window information, storing start and end frame indices.
This class represents a window of frames in a video, defined by its start and end frame positions.
Return the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’.
Generate windows by splitting the video into segments of the specified size.
Parameters:
total frames
The size of each window in number of frames.
The minimum number of frames required to create a new window from the remainder.
Load and preprocess video frames from a file.
Parameters:
Path to the video file.
Target frames per second for sampling.
List of frame windows to extract.
Whether to preprocess the frames.
Data type for preprocessing.
Number of frames to extract (0 for all).
Whether to flip frames horizontally.
Returns: tuple[torch.Tensor, list[int]]
Tuple of (processed frames tensor, frame indices).
Return the largest integer less than or equal to ‘number’ that is divisible by ‘factor’.
Read video using PyAv.
Parameters:
path to the video support “file://”, “http://”, “https://” and local path.
frames per second
number of frames to use
window range
Returns: tuple[torch.Tensor, list[int]]
torch.Tensor: the video tensor with shape (T, C, H, W).
Return the closest integer to ‘number’ that is divisible by ‘factor’.
Calculate the number of frames for video used for model inputs.
Rescales the image so that the following conditions are met.
-
Both dimensions (height and width) are divisible by ‘factor’.
-
The total number of pixels is within the range [‘min_pixels’, ‘max_pixels’].
-
The aspect ratio of the image is maintained as closely as possible.
Calculate windows and return video inputs for language model from input clips.
Processes video to determine the windows for a clip, decode in one shot and return processed frames for each window in a format suitable for consumption by the Qwen model.
Parameters:
input video in bytes
Frames per second of the input video.
Data type to use for preprocessing the video/image inputs.
Number of frames to extract from the video. If 0, uses all frames.
Whether to flip the input video/image horizontally.
Whether to extract mp4 bytes for each window for use by PreviewStage
if the model does preprocessing
number of threads
threshold for remainder
whether to return video frames
sampling fps
window size
Returns: tuple[list[bytes], list[torch.Tensor | None], list[WindowFrameInfo]]
Tuple containing:
- “window_mp4_bytes”: mp4 bytes corresponding to each window - only used when Preview stage is enabled
- “window_frames”: Decoded and per-window processed frames ready for use by Qwen model
- “window info”: start and end frame indices for each window in a clip