utils.windowing_utils#

Module Contents#

Classes#

WindowFrameInfo

Container for frame window information, storing start and end frame indices.

Functions#

ceil_by_factor

Return the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’.

compute_windows

Generate windows by splitting the video into segments of the specified size.

fetch_video

Load and preprocess video frames from a file.

floor_by_factor

Return the largest integer less than or equal to ‘number’ that is divisible by ‘factor’.

read_video_cpu

Read video using PyAv.

round_by_factor

Return the closest integer to ‘number’ that is divisible by ‘factor’.

smart_nframes

Calculate the number of frames for video used for model inputs.

smart_resize

Rescales the image so that the following conditions are met.

split_video_into_windows

Calculate windows and return video inputs for language model from input clips.

Data#

API#

utils.windowing_utils.FPS#

2.0

utils.windowing_utils.FPS_MAX_FRAMES#

768

utils.windowing_utils.FPS_MIN_FRAMES#

4

utils.windowing_utils.FRAME_FACTOR#

2

utils.windowing_utils.IMAGE_FACTOR#

28

utils.windowing_utils.MAX_PIXELS#

None

utils.windowing_utils.MAX_RATIO#

200

utils.windowing_utils.MIN_PIXELS#

None

utils.windowing_utils.OPENAI_CLIP_MEAN#

[0.48145466, 0.4578275, 0.40821073]

utils.windowing_utils.OPENAI_CLIP_STD#

[0.26862954, 0.26130258, 0.27577711]

utils.windowing_utils.VIDEO_MAX_PIXELS#

None

utils.windowing_utils.VIDEO_MIN_PIXELS#

None

utils.windowing_utils.VIDEO_TOTAL_PIXELS#

None

utils.windowing_utils.WINDOW_MIN_FRAMES#

4

class utils.windowing_utils.WindowFrameInfo#

Container for frame window information, storing start and end frame indices.

This class represents a window of frames in a video, defined by its start and end frame positions.

end: int#

None

start: int#

None

utils.windowing_utils.ceil_by_factor(number: float, factor: int) int#

Return the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’.

utils.windowing_utils.compute_windows(
total_frames: int,
window_size: int = 128,
remainder_threshold: int = 64,
) list[utils.windowing_utils.WindowFrameInfo]#

Generate windows by splitting the video into segments of the specified size.

Args: total_frames: total frames window_size: The size of each window in number of frames. remainder_threshold: The minimum number of frames required to create a new window from the remainder.

Yields: Tuple of (start_frame, end_frame) representing each window.

utils.windowing_utils.fetch_video(
video_path: str,
sampling_fps: float = 2.0,
window_range: list[utils.windowing_utils.WindowFrameInfo] | None = None,
*,
do_preprocess: bool = False,
preprocess_dtype: str = 'float32',
num_frames_to_use: int = 0,
flip_input: bool = False,
) tuple[torch.Tensor, list[int]]#

Load and preprocess video frames from a file.

Args: video_path: Path to the video file. sampling_fps: Target frames per second for sampling. window_range: List of frame windows to extract. do_preprocess: Whether to preprocess the frames. preprocess_dtype: Data type for preprocessing. num_frames_to_use: Number of frames to extract (0 for all). flip_input: Whether to flip frames horizontally.

Returns: Tuple of (processed frames tensor, frame indices).

utils.windowing_utils.floor_by_factor(number: float, factor: int) int#

Return the largest integer less than or equal to ‘number’ that is divisible by ‘factor’.

utils.windowing_utils.read_video_cpu(
video_path: str,
fps: float,
num_frames_to_use: int,
window_range: list[utils.windowing_utils.WindowFrameInfo],
) tuple[torch.Tensor, list[int]]#

Read video using PyAv.

Args: video_path: path to the video support “file://”, “http://”, “https://” and local path. fps: frames per second num_frames_to_use: number of frames to use window_range: window range

Returns: torch.Tensor: the video tensor with shape (T, C, H, W).

utils.windowing_utils.round_by_factor(number: float, factor: int) int#

Return the closest integer to ‘number’ that is divisible by ‘factor’.

utils.windowing_utils.smart_nframes(fps: float, total_frames: int, video_fps: float) int#

Calculate the number of frames for video used for model inputs.

utils.windowing_utils.smart_resize(
height: int,
width: int,
factor: int = IMAGE_FACTOR,
min_pixels: int = MIN_PIXELS,
max_pixels: int = MAX_PIXELS,
) tuple[int, int]#

Rescales the image so that the following conditions are met.

  1. Both dimensions (height and width) are divisible by ‘factor’.

  2. The total number of pixels is within the range [‘min_pixels’, ‘max_pixels’].

  3. The aspect ratio of the image is maintained as closely as possible.

utils.windowing_utils.split_video_into_windows(
mp4_bytes: bytes,
window_size: int = 256,
remainder_threshold: int = 128,
sampling_fps: float = 2.0,
*,
model_does_preprocess: bool = False,
preprocess_dtype: str = 'uint8',
flip_input: bool = False,
num_frames_to_use: int = 0,
return_bytes: bool = False,
return_video_frames: bool = True,
num_threads: int = 1,
) tuple[list[bytes], list[torch.Tensor | None], list[utils.windowing_utils.WindowFrameInfo]]#

Calculate windows and return video inputs for language model from input clips.

Processes video to determine the windows for a clip, decode in one shot and return processed frames for each window in a format suitable for consumption by the Qwen model.

Args: mp4_bytes: input video in bytes fps: Frames per second of the input video. preprocess_dtype: Data type to use for preprocessing the video/image inputs. num_frames_to_use: Number of frames to extract from the video. If 0, uses all frames. flip_input: Whether to flip the input video/image horizontally. return_bytes: Whether to extract mp4 bytes for each window for use by PreviewStage model_does_preprocess: if the model does preprocessing num_threads: number of threads remainder_threshold: threshold for remainder return_video_frames: whether to return video frames sampling_fps: sampling fps window_size: window size

Returns: Tuple containing: - “window_mp4_bytes”: mp4 bytes corresponding to each window - only used when Preview stage is enabled - “window_frames”: Decoded and per-window processed frames ready for use by Qwen model - “window info”: start and end frame indices for each window in a clip