`utils.windowing_utils`#

Module Contents#

Classes#

WindowFrameInfo

Container for frame window information, storing start and end frame indices.

Functions#

`ceil_by_factor`	Return the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’.
`compute_windows`	Generate windows by splitting the video into segments of the specified size.
`fetch_video`	Load and preprocess video frames from a file.
`floor_by_factor`	Return the largest integer less than or equal to ‘number’ that is divisible by ‘factor’.
`read_video_cpu`	Read video using PyAv.
`round_by_factor`	Return the closest integer to ‘number’ that is divisible by ‘factor’.
`smart_nframes`	Calculate the number of frames for video used for model inputs.
`smart_resize`	Rescales the image so that the following conditions are met.
`split_video_into_windows`	Calculate windows and return video inputs for language model from input clips.

Data#

`FPS`
`FPS_MAX_FRAMES`
`FPS_MIN_FRAMES`
`FRAME_FACTOR`
`IMAGE_FACTOR`
`MAX_PIXELS`
`MAX_RATIO`
`MIN_PIXELS`
`OPENAI_CLIP_MEAN`
`OPENAI_CLIP_STD`
`VIDEO_MAX_PIXELS`
`VIDEO_MIN_PIXELS`
`VIDEO_TOTAL_PIXELS`
`WINDOW_MIN_FRAMES`

API#

utils.windowing_utils.FPS#: 2.0

utils.windowing_utils.FPS_MAX_FRAMES#: 768

utils.windowing_utils.FPS_MIN_FRAMES#: 4

utils.windowing_utils.FRAME_FACTOR#: 2

utils.windowing_utils.IMAGE_FACTOR#: 28

utils.windowing_utils.MAX_PIXELS#: None

utils.windowing_utils.MAX_RATIO#: 200

utils.windowing_utils.MIN_PIXELS#: None

utils.windowing_utils.OPENAI_CLIP_MEAN#: [0.48145466, 0.4578275, 0.40821073]

utils.windowing_utils.OPENAI_CLIP_STD#: [0.26862954, 0.26130258, 0.27577711]

utils.windowing_utils.VIDEO_MAX_PIXELS#: None

utils.windowing_utils.VIDEO_MIN_PIXELS#: None

utils.windowing_utils.VIDEO_TOTAL_PIXELS#: None

utils.windowing_utils.WINDOW_MIN_FRAMES#: 4

class utils.windowing_utils.WindowFrameInfo#

Container for frame window information, storing start and end frame indices.

This class represents a window of frames in a video, defined by its start and end frame positions.

end: int#: None

start: int#: None

utils.windowing_utils.ceil_by_factor(number: float, factor: int) → int#: Return the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’.

utils.windowing_utils.compute_windows( total_frames: int, window_size: int = 128, remainder_threshold: int = 64, ) → list[utils.windowing_utils.WindowFrameInfo]#

Generate windows by splitting the video into segments of the specified size.

Args: total_frames: total frames window_size: The size of each window in number of frames. remainder_threshold: The minimum number of frames required to create a new window from the remainder.

Yields: Tuple of (start_frame, end_frame) representing each window.

utils.windowing_utils.fetch_video( video_path: str, sampling_fps: float = 2.0, window_range: list[utils.windowing_utils.WindowFrameInfo] | None = None, *, do_preprocess: bool = False, preprocess_dtype: str = 'float32', num_frames_to_use: int = 0, flip_input: bool = False, ) → tuple[torch.Tensor, list[int]]#

Load and preprocess video frames from a file.

Args: video_path: Path to the video file. sampling_fps: Target frames per second for sampling. window_range: List of frame windows to extract. do_preprocess: Whether to preprocess the frames. preprocess_dtype: Data type for preprocessing. num_frames_to_use: Number of frames to extract (0 for all). flip_input: Whether to flip frames horizontally.

Returns: Tuple of (processed frames tensor, frame indices).

utils.windowing_utils.floor_by_factor(number: float, factor: int) → int#: Return the largest integer less than or equal to ‘number’ that is divisible by ‘factor’.

utils.windowing_utils.read_video_cpu( video_path: str, fps: float, num_frames_to_use: int, window_range: list[utils.windowing_utils.WindowFrameInfo], ) → tuple[torch.Tensor, list[int]]#

Read video using PyAv.

Args: video_path: path to the video support “file://”, “http://”, “https://” and local path. fps: frames per second num_frames_to_use: number of frames to use window_range: window range

Returns: torch.Tensor: the video tensor with shape (T, C, H, W).

utils.windowing_utils.round_by_factor(number: float, factor: int) → int#: Return the closest integer to ‘number’ that is divisible by ‘factor’.

utils.windowing_utils.smart_nframes(fps: float, total_frames: int, video_fps: float) → int#: Calculate the number of frames for video used for model inputs.

utils.windowing_utils.smart_resize( height: int, width: int, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS, ) → tuple[int, int]#

Rescales the image so that the following conditions are met.

Both dimensions (height and width) are divisible by ‘factor’.
The total number of pixels is within the range [‘min_pixels’, ‘max_pixels’].
The aspect ratio of the image is maintained as closely as possible.

utils.windowing_utils.split_video_into_windows( mp4_bytes: bytes, window_size: int = 256, remainder_threshold: int = 128, sampling_fps: float = 2.0, *, model_does_preprocess: bool = False, preprocess_dtype: str = 'uint8', flip_input: bool = False, num_frames_to_use: int = 0, return_bytes: bool = False, return_video_frames: bool = True, num_threads: int = 1, ) → tuple[list[bytes], list[torch.Tensor | None], list[utils.windowing_utils.WindowFrameInfo]]#

Calculate windows and return video inputs for language model from input clips.

Processes video to determine the windows for a clip, decode in one shot and return processed frames for each window in a format suitable for consumption by the Qwen model.

Args: mp4_bytes: input video in bytes fps: Frames per second of the input video. preprocess_dtype: Data type to use for preprocessing the video/image inputs. num_frames_to_use: Number of frames to extract from the video. If 0, uses all frames. flip_input: Whether to flip the input video/image horizontally. return_bytes: Whether to extract mp4 bytes for each window for use by PreviewStage model_does_preprocess: if the model does preprocessing num_threads: number of threads remainder_threshold: threshold for remainder return_video_frames: whether to return video frames sampling_fps: sampling fps window_size: window size

Returns: Tuple containing: - “window_mp4_bytes”: mp4 bytes corresponding to each window - only used when Preview stage is enabled - “window_frames”: Decoded and per-window processed frames ready for use by Qwen model - “window info”: start and end frame indices for each window in a clip

utils.windowing_utils#