utils.windowing_utils
#
Module Contents#
Classes#
Container for frame window information, storing start and end frame indices. |
Functions#
Return the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’. |
|
Generate windows by splitting the video into segments of the specified size. |
|
Load and preprocess video frames from a file. |
|
Return the largest integer less than or equal to ‘number’ that is divisible by ‘factor’. |
|
Read video using PyAv. |
|
Return the closest integer to ‘number’ that is divisible by ‘factor’. |
|
Calculate the number of frames for video used for model inputs. |
|
Rescales the image so that the following conditions are met. |
|
Calculate windows and return video inputs for language model from input clips. |
Data#
API#
- utils.windowing_utils.FPS#
2.0
- utils.windowing_utils.FPS_MAX_FRAMES#
768
- utils.windowing_utils.FPS_MIN_FRAMES#
4
- utils.windowing_utils.FRAME_FACTOR#
2
- utils.windowing_utils.IMAGE_FACTOR#
28
- utils.windowing_utils.MAX_PIXELS#
None
- utils.windowing_utils.MAX_RATIO#
200
- utils.windowing_utils.MIN_PIXELS#
None
- utils.windowing_utils.OPENAI_CLIP_MEAN#
[0.48145466, 0.4578275, 0.40821073]
- utils.windowing_utils.OPENAI_CLIP_STD#
[0.26862954, 0.26130258, 0.27577711]
- utils.windowing_utils.VIDEO_MAX_PIXELS#
None
- utils.windowing_utils.VIDEO_MIN_PIXELS#
None
- utils.windowing_utils.VIDEO_TOTAL_PIXELS#
None
- utils.windowing_utils.WINDOW_MIN_FRAMES#
4
- class utils.windowing_utils.WindowFrameInfo#
Container for frame window information, storing start and end frame indices.
This class represents a window of frames in a video, defined by its start and end frame positions.
- end: int#
None
- start: int#
None
- utils.windowing_utils.ceil_by_factor(number: float, factor: int) int #
Return the smallest integer greater than or equal to ‘number’ that is divisible by ‘factor’.
- utils.windowing_utils.compute_windows(
- total_frames: int,
- window_size: int = 128,
- remainder_threshold: int = 64,
Generate windows by splitting the video into segments of the specified size.
Args: total_frames: total frames window_size: The size of each window in number of frames. remainder_threshold: The minimum number of frames required to create a new window from the remainder.
Yields: Tuple of (start_frame, end_frame) representing each window.
- utils.windowing_utils.fetch_video(
- video_path: str,
- sampling_fps: float = 2.0,
- window_range: list[utils.windowing_utils.WindowFrameInfo] | None = None,
- *,
- do_preprocess: bool = False,
- preprocess_dtype: str = 'float32',
- num_frames_to_use: int = 0,
- flip_input: bool = False,
Load and preprocess video frames from a file.
Args: video_path: Path to the video file. sampling_fps: Target frames per second for sampling. window_range: List of frame windows to extract. do_preprocess: Whether to preprocess the frames. preprocess_dtype: Data type for preprocessing. num_frames_to_use: Number of frames to extract (0 for all). flip_input: Whether to flip frames horizontally.
Returns: Tuple of (processed frames tensor, frame indices).
- utils.windowing_utils.floor_by_factor(number: float, factor: int) int #
Return the largest integer less than or equal to ‘number’ that is divisible by ‘factor’.
- utils.windowing_utils.read_video_cpu(
- video_path: str,
- fps: float,
- num_frames_to_use: int,
- window_range: list[utils.windowing_utils.WindowFrameInfo],
Read video using PyAv.
Args: video_path: path to the video support “file://”, “http://”, “https://” and local path. fps: frames per second num_frames_to_use: number of frames to use window_range: window range
Returns: torch.Tensor: the video tensor with shape (T, C, H, W).
- utils.windowing_utils.round_by_factor(number: float, factor: int) int #
Return the closest integer to ‘number’ that is divisible by ‘factor’.
- utils.windowing_utils.smart_nframes(fps: float, total_frames: int, video_fps: float) int #
Calculate the number of frames for video used for model inputs.
- utils.windowing_utils.smart_resize(
- height: int,
- width: int,
- factor: int = IMAGE_FACTOR,
- min_pixels: int = MIN_PIXELS,
- max_pixels: int = MAX_PIXELS,
Rescales the image so that the following conditions are met.
Both dimensions (height and width) are divisible by ‘factor’.
The total number of pixels is within the range [‘min_pixels’, ‘max_pixels’].
The aspect ratio of the image is maintained as closely as possible.
- utils.windowing_utils.split_video_into_windows(
- mp4_bytes: bytes,
- window_size: int = 256,
- remainder_threshold: int = 128,
- sampling_fps: float = 2.0,
- *,
- model_does_preprocess: bool = False,
- preprocess_dtype: str = 'uint8',
- flip_input: bool = False,
- num_frames_to_use: int = 0,
- return_bytes: bool = False,
- return_video_frames: bool = True,
- num_threads: int = 1,
Calculate windows and return video inputs for language model from input clips.
Processes video to determine the windows for a clip, decode in one shot and return processed frames for each window in a format suitable for consumption by the Qwen model.
Args: mp4_bytes: input video in bytes fps: Frames per second of the input video. preprocess_dtype: Data type to use for preprocessing the video/image inputs. num_frames_to_use: Number of frames to extract from the video. If 0, uses all frames. flip_input: Whether to flip the input video/image horizontally. return_bytes: Whether to extract mp4 bytes for each window for use by PreviewStage model_does_preprocess: if the model does preprocessing num_threads: number of threads remainder_threshold: threshold for remainder return_video_frames: whether to return video frames sampling_fps: sampling fps window_size: window size
Returns: Tuple containing: - “window_mp4_bytes”: mp4 bytes corresponding to each window - only used when Preview stage is enabled - “window_frames”: Decoded and per-window processed frames ready for use by Qwen model - “window info”: start and end frame indices for each window in a clip