nemo_curator.utils.decoder_utils

Module Contents

Classes

Name	Description
`FrameExtractionPolicy`	Policy for extracting frames from video content.
`FrameExtractionSignature`	Configuration for frame extraction parameters.
`FramePurpose`	Purpose for extracting frames from video content.
`Resolution`	Container for video frame dimensions.
`VideoMetadata`	Metadata for video content including dimensions, timing, and codec information.

Functions

Name	Description
`_make_video_stream`	Convert various input types into a binary stream for video processing.
`decode_video_cpu`	Decode video frames from a binary stream using PyAV with configurable frame rate sampling.
`decode_video_cpu_frame_ids`	Decode video using PyAV frame ids.
`extract_frames`	Extract frames from a video into a numpy array.
`extract_video_metadata`	Extract metadata from a video file using ffprobe.
`find_closest_indices`	Find the closest indices in src to each element in dst.
`get_avg_frame_rate`	Get the average frame rate of a video.
`get_frame_count`	Get the total number of frames in a video file or stream.
`get_video_timestamps`	Get timestamps for all frames in a video stream.
`sample_closest`	Sample `src` at `sample_rate` rate and return the closest indices.
`save_stream_position`	Context manager that saves and restores stream position.

API

class nemo_curator.utils.decoder_utils.FrameExtractionPolicy

Bases: enum.Enum

Policy for extracting frames from video content.

This enum defines different strategies for selecting frames from a video, including first frame, middle frame, last frame, or a sequence of frames.

first

= 0

last

= 2

middle

= 1

sequence

= 3

class nemo_curator.utils.decoder_utils.FrameExtractionSignature(
    extraction_policy: nemo_curator.utils.decoder_utils.FrameExtractionPolicy,
    target_fps: float
)

Dataclass

Configuration for frame extraction parameters.

This class combines extraction policy and target frame rate into a single signature that can be used to identify and reproduce frame extraction settings.

extraction_policy

FrameExtractionPolicy

target_fps

float

nemo_curator.utils.decoder_utils.FrameExtractionSignature.to_str() -> str

Convert frame extraction signature to string format.

Returns: str

String representation of extraction policy and target FPS.

class nemo_curator.utils.decoder_utils.FramePurpose

Bases: enum.Enum

Purpose for extracting frames from video content.

This enum defines different purposes for extracting frames from a video, including aesthetics and embeddings.

AESTHETICS

= 1

EMBEDDINGS

= 2

class nemo_curator.utils.decoder_utils.Resolution()

Bases: NamedTuple

Container for video frame dimensions.

This class stores the height and width of video frames as a named tuple.

height

int

width

int

class nemo_curator.utils.decoder_utils.VideoMetadata(
    height: int = None,
    width: int = None,
    fps: float = None,
    num_frames: int = None,
    video_codec: str = None,
    pixel_format: str = None,
    video_duration: float = None,
    audio_codec: str = None,
    bit_rate_k: int = None
)

Dataclass

Metadata for video content including dimensions, timing, and codec information.

This class stores essential video properties such as resolution, frame rate, duration, and encoding details.

audio_codec

str = None

bit_rate_k

int = None

fps

float = None

height

int = None

num_frames

int = None

pixel_format

str = None

video_codec

str = None

video_duration

float = None

width

int = None

nemo_curator.utils.decoder_utils._make_video_stream(
    data: pathlib.Path | str | typing.BinaryIO | bytes | io.BytesIO | io.BufferedReader
) -> typing.BinaryIO

Convert various input types into a binary stream for video processing.

This function handles different input types that could represent video data and converts them into a consistent BinaryIO interface that can be used for video processing operations.

Parameters:

data

The input video data, which can be one of:

Path: A path to a video file
bytes: Raw video data in bytes
io.BytesIO: An in-memory binary stream
io.BufferedReader: A buffered binary file reader
BinaryIO: Any binary stream

Returns: BinaryIO

A binary stream containing the video data

Raises:

ValueError: If the input type is not one of the supported types

nemo_curator.utils.decoder_utils.decode_video_cpu(
    data: pathlib.Path | str | typing.BinaryIO | bytes,
    sample_rate_fps: float,
    timestamps: numpy.typing.NDArray[numpy.float32] | None = None,
    start: float | None = None,
    stop: float | None = None,
    endpoint: bool = True,
    stream_idx: int = 0,
    video_format: str | None = None,
    num_threads: int = 1
) -> numpy.typing.NDArray[numpy.uint8]

Decode video frames from a binary stream using PyAV with configurable frame rate sampling.

This function decodes video frames from a binary stream at a specified frame rate. The frame rate does not need to match the input video’s frame rate. It is possible to supersample a video as well as undersample.

Parameters:

data

Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

sample_rate_fps

float

Frame rate for sampling the video

timestamps

npt.NDArray[np.float32] | NoneDefaults to None

Optional array of presentation timestamps for each frame in the video. If supplied, this array must be monotonically increasing. If not supplied, timestamps will be extracted from the video stream.

start

float | NoneDefaults to None

Optional start timestamp for frame extraction. If None, the first frame timestamp is used.

stop

float | NoneDefaults to None

Optional end timestamp for frame extraction. If None, the last frame timestamp is used.

endpoint

boolDefaults to True

If True, stop is the last sample. Otherwise, it is not included. Default is True.

stream_idx

intDefaults to 0

PyAv index of the video stream to decode, usually 0.

video_format

str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

num_threads

intDefaults to 1

Number of threads to use for decoding.

Returns: npt.NDArray[np.uint8]

A numpy array of shape (num_frames, height, width, channels) containing the decoded

Raises:

ValueError: If the sampled timestamps differ from source timestamps by more than the specified tolerance

nemo_curator.utils.decoder_utils.decode_video_cpu_frame_ids(
    data: pathlib.Path | str | typing.BinaryIO | bytes,
    frame_ids: numpy.typing.NDArray[numpy.int32],
    counts: numpy.typing.NDArray[numpy.int32] | None = None,
    stream_idx: int = 0,
    video_format: str | None = None,
    num_threads: int = 1
) -> numpy.typing.NDArray[numpy.uint8]

Decode video using PyAV frame ids.

It is not recommended to use this function directly. Instead, use decode_video_cpu, which is timestamp-based. Timestamps are necessary for synchronizing sensors, like multiple cameras, or synchronizing video with GPS and LIDAR.

Parameters:

data

Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

frame_ids

npt.NDArray[np.int32]

List of frame ids to decode.

counts

npt.NDArray[np.int32] | NoneDefaults to None

List of counts for each frame id. It is possible that a frame id is repeated during supersampling, which can happen in videos with frame drops, or just due to clock drift between sensors.

stream_idx

intDefaults to 0

PyAv index of the video stream to decode, usually 0.

video_format

str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

num_threads

intDefaults to 1

Number of threads to use for decoding.

Returns: npt.NDArray[np.uint8]

A numpy array of shape (frame_count, height, width, channels) containing

nemo_curator.utils.decoder_utils.extract_frames(
    video: pathlib.Path | str | typing.BinaryIO | bytes,
    extraction_policy: nemo_curator.utils.decoder_utils.FrameExtractionPolicy,
    sample_rate_fps: float = 1.0,
    target_res: tuple[int, int] = (-1, -1),
    num_threads: int = 1,
    stream_idx: int = 0,
    video_format: str | None = None
) -> numpy.typing.NDArray[numpy.uint8]

Extract frames from a video into a numpy array.

Parameters:

video

Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

extraction_policy

FrameExtractionPolicy

The policy for extracting frames.

sample_rate_fps

floatDefaults to 1.0

Frame rate for sampling the video

target_res

tuple[int, int]Defaults to (-1, -1)

The target resolution for the frames.

stream_idx

intDefaults to 0

PyAv index of the video stream to decode, usually 0.

video_format

str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

num_threads

intDefaults to 1

Number of threads to use for decoding.

Returns: npt.NDArray[np.uint8]

A numpy array of shape (num_frames, height, width, 3) containing the decoded

nemo_curator.utils.decoder_utils.extract_video_metadata(
    video: str | bytes
) -> nemo_curator.utils.decoder_utils.VideoMetadata

Extract metadata from a video file using ffprobe.

Parameters:

video

str | bytes

Path to video file or video data as bytes.

Returns: VideoMetadata

VideoMetadata object containing video properties.

nemo_curator.utils.decoder_utils.find_closest_indices(
    src: numpy.typing.NDArray[numpy.float32],
    dst: numpy.typing.NDArray[numpy.float32]
) -> numpy.typing.NDArray[numpy.int32]

Find the closest indices in src to each element in dst.

If an element in dst is equidistant from two elements in src, the left index in src is used.

Parameters:

src

npt.NDArray[np.float32]

Monotonically increasing array of numbers to match dst against

dst

npt.NDArray[np.float32]

Monotonically increasing array of numbers to search for in src

Returns: npt.NDArray[np.int32]

Array of closest indices in src for each element in dst

nemo_curator.utils.decoder_utils.get_avg_frame_rate(
    data: pathlib.Path | str | typing.BinaryIO | bytes,
    stream_idx: int = 0,
    video_format: str | None = None
) -> float

Get the average frame rate of a video.

Parameters:

data

Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

stream_idx

intDefaults to 0

Index of the video stream to decode, usually 0.

video_format

str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

Returns: float

The average frame rate of the video.

nemo_curator.utils.decoder_utils.get_frame_count(
    data: pathlib.Path | str | typing.BinaryIO | bytes,
    stream_idx: int = 0,
    video_format: str | None = None
) -> int

Get the total number of frames in a video file or stream.

Parameters:

data

Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

stream_idx

intDefaults to 0

Index of the video stream to read from. Defaults to 0, which is typically the main video stream.

video_format

str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

Returns: int

The total number of frames in the video stream.

nemo_curator.utils.decoder_utils.get_video_timestamps(
    data: pathlib.Path | str | typing.BinaryIO | bytes,
    stream_idx: int = 0,
    video_format: str | None = None
) -> numpy.typing.NDArray[numpy.float32]

Get timestamps for all frames in a video stream.

The file position will be moved as needed to get the timestamps.

Note: the order that frames appear in a video stream is not necessarily the order that the frames will be displayed. This means that timestamps are not monotonically increasing within a video stream. This can happen when B-frames are present

This function will return presentation timestamps in monotonically increasing order.

Parameters:

data

Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

stream_idx

intDefaults to 0

PyAv index of the video stream to decode, usually 0.

video_format

str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

Returns: npt.NDArray[np.float32]

A numpy array of monotonically increasing timestamps.

nemo_curator.utils.decoder_utils.sample_closest(
    src: numpy.typing.NDArray[numpy.float32],
    sample_rate: float,
    start: float | None = None,
    stop: float | None = None,
    endpoint: bool = True,
    dedup: bool = True
) -> tuple[numpy.typing.NDArray[numpy.int32], numpy.typing.NDArray[numpy.int32], numpy.typing.NDArray[numpy.float32]]

Sample src at sample_rate rate and return the closest indices.

This function is meant to be used for sampling monotonically increasing numbers, like timestamps. This function can be used for synchronizing sensors, like multiple cameras, or synchronizing video with GPS and LIDAR.

The first element sampled with either or src[0] or the element closest to start

The last element sampled will either be src[-1] or the element closest to stop. The last element is only included if it both fits into the sampling rate and if endpoint=True

This function intentionally has no policy about distance from the closest elements in src to the sample elements. It will return the index of the closest element to the sample. It is up to the caller to define policy, which is why sample_elements is returned.

Parameters:

src

npt.NDArray[np.float32]

Monotonically increasing array of elements

sample_rate

float

Sampling rate

start

float | NoneDefaults to None

Start element (defaults to first element)

stop

float | NoneDefaults to None

End element (defaults to last element)

endpoint

boolDefaults to True

If True, stop can be the last sample, if it fits into the sample rate. If False, stop is not included in the output.

dedup

boolDefaults to True

Whether to deduplicate indices. Repeated indices will be reflected in the returned counts array.

Returns: npt.NDArray[np.int32]

Tuple of (indices, counts) where counts[i] is the number of times

nemo_curator.utils.decoder_utils.save_stream_position(
    stream: typing.BinaryIO
) -> collections.abc.Generator[typing.BinaryIO, None, None]

Context manager that saves and restores stream position.