nemo_curator.utils.decoder_utils

View as Markdown

Module Contents

Classes

NameDescription
FrameExtractionPolicyPolicy for extracting frames from video content.
FrameExtractionSignatureConfiguration for frame extraction parameters.
FramePurposePurpose for extracting frames from video content.
ResolutionContainer for video frame dimensions.
VideoMetadataMetadata for video content including dimensions, timing, and codec information.

Functions

NameDescription
_make_video_streamConvert various input types into a binary stream for video processing.
decode_video_cpuDecode video frames from a binary stream using PyAV with configurable frame rate sampling.
decode_video_cpu_frame_idsDecode video using PyAV frame ids.
extract_framesExtract frames from a video into a numpy array.
extract_video_metadataExtract metadata from a video file using ffprobe.
find_closest_indicesFind the closest indices in src to each element in dst.
get_avg_frame_rateGet the average frame rate of a video.
get_frame_countGet the total number of frames in a video file or stream.
get_video_timestampsGet timestamps for all frames in a video stream.
sample_closestSample src at sample_rate rate and return the closest indices.
save_stream_positionContext manager that saves and restores stream position.

API

class nemo_curator.utils.decoder_utils.FrameExtractionPolicy

Bases: enum.Enum

Policy for extracting frames from video content.

This enum defines different strategies for selecting frames from a video, including first frame, middle frame, last frame, or a sequence of frames.

first
= 0
last
= 2
middle
= 1
sequence
= 3
class nemo_curator.utils.decoder_utils.FrameExtractionSignature(
extraction_policy: nemo_curator.utils.decoder_utils.FrameExtractionPolicy,
target_fps: float
)
Dataclass

Configuration for frame extraction parameters.

This class combines extraction policy and target frame rate into a single signature that can be used to identify and reproduce frame extraction settings.

extraction_policy
FrameExtractionPolicy
target_fps
float
nemo_curator.utils.decoder_utils.FrameExtractionSignature.to_str() -> str

Convert frame extraction signature to string format.

Returns: str

String representation of extraction policy and target FPS.

class nemo_curator.utils.decoder_utils.FramePurpose

Bases: enum.Enum

Purpose for extracting frames from video content.

This enum defines different purposes for extracting frames from a video, including aesthetics and embeddings.

AESTHETICS
= 1
EMBEDDINGS
= 2
class nemo_curator.utils.decoder_utils.Resolution()

Bases: NamedTuple

Container for video frame dimensions.

This class stores the height and width of video frames as a named tuple.

height
int
width
int
class nemo_curator.utils.decoder_utils.VideoMetadata(
height: int = None,
width: int = None,
fps: float = None,
num_frames: int = None,
video_codec: str = None,
pixel_format: str = None,
video_duration: float = None,
audio_codec: str = None,
bit_rate_k: int = None
)
Dataclass

Metadata for video content including dimensions, timing, and codec information.

This class stores essential video properties such as resolution, frame rate, duration, and encoding details.

audio_codec
str = None
bit_rate_k
int = None
fps
float = None
height
int = None
num_frames
int = None
pixel_format
str = None
video_codec
str = None
video_duration
float = None
width
int = None
nemo_curator.utils.decoder_utils._make_video_stream(
data: pathlib.Path | str | typing.BinaryIO | bytes | io.BytesIO | io.BufferedReader
) -> typing.BinaryIO

Convert various input types into a binary stream for video processing.

This function handles different input types that could represent video data and converts them into a consistent BinaryIO interface that can be used for video processing operations.

Parameters:

data
Path | str | BinaryIO | bytes | io.BytesIO | io.BufferedReader

The input video data, which can be one of:

  • Path: A path to a video file
  • bytes: Raw video data in bytes
  • io.BytesIO: An in-memory binary stream
  • io.BufferedReader: A buffered binary file reader
  • BinaryIO: Any binary stream

Returns: BinaryIO

A binary stream containing the video data

Raises:

  • ValueError: If the input type is not one of the supported types
nemo_curator.utils.decoder_utils.decode_video_cpu(
data: pathlib.Path | str | typing.BinaryIO | bytes,
sample_rate_fps: float,
timestamps: numpy.typing.NDArray[numpy.float32] | None = None,
start: float | None = None,
stop: float | None = None,
endpoint: bool = True,
stream_idx: int = 0,
video_format: str | None = None,
num_threads: int = 1
) -> numpy.typing.NDArray[numpy.uint8]

Decode video frames from a binary stream using PyAV with configurable frame rate sampling.

This function decodes video frames from a binary stream at a specified frame rate. The frame rate does not need to match the input video’s frame rate. It is possible to supersample a video as well as undersample.

Parameters:

data
Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

sample_rate_fps
float

Frame rate for sampling the video

timestamps
npt.NDArray[np.float32] | NoneDefaults to None

Optional array of presentation timestamps for each frame in the video. If supplied, this array must be monotonically increasing. If not supplied, timestamps will be extracted from the video stream.

start
float | NoneDefaults to None

Optional start timestamp for frame extraction. If None, the first frame timestamp is used.

stop
float | NoneDefaults to None

Optional end timestamp for frame extraction. If None, the last frame timestamp is used.

endpoint
boolDefaults to True

If True, stop is the last sample. Otherwise, it is not included. Default is True.

stream_idx
intDefaults to 0

PyAv index of the video stream to decode, usually 0.

video_format
str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

num_threads
intDefaults to 1

Number of threads to use for decoding.

Returns: npt.NDArray[np.uint8]

A numpy array of shape (num_frames, height, width, channels) containing the decoded

Raises:

  • ValueError: If the sampled timestamps differ from source timestamps by more than the specified tolerance
nemo_curator.utils.decoder_utils.decode_video_cpu_frame_ids(
data: pathlib.Path | str | typing.BinaryIO | bytes,
frame_ids: numpy.typing.NDArray[numpy.int32],
counts: numpy.typing.NDArray[numpy.int32] | None = None,
stream_idx: int = 0,
video_format: str | None = None,
num_threads: int = 1
) -> numpy.typing.NDArray[numpy.uint8]

Decode video using PyAV frame ids.

It is not recommended to use this function directly. Instead, use decode_video_cpu, which is timestamp-based. Timestamps are necessary for synchronizing sensors, like multiple cameras, or synchronizing video with GPS and LIDAR.

Parameters:

data
Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

frame_ids
npt.NDArray[np.int32]

List of frame ids to decode.

counts
npt.NDArray[np.int32] | NoneDefaults to None

List of counts for each frame id. It is possible that a frame id is repeated during supersampling, which can happen in videos with frame drops, or just due to clock drift between sensors.

stream_idx
intDefaults to 0

PyAv index of the video stream to decode, usually 0.

video_format
str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

num_threads
intDefaults to 1

Number of threads to use for decoding.

Returns: npt.NDArray[np.uint8]

A numpy array of shape (frame_count, height, width, channels) containing

nemo_curator.utils.decoder_utils.extract_frames(
video: pathlib.Path | str | typing.BinaryIO | bytes,
extraction_policy: nemo_curator.utils.decoder_utils.FrameExtractionPolicy,
sample_rate_fps: float = 1.0,
target_res: tuple[int, int] = (-1, -1),
num_threads: int = 1,
stream_idx: int = 0,
video_format: str | None = None
) -> numpy.typing.NDArray[numpy.uint8]

Extract frames from a video into a numpy array.

Parameters:

video
Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

extraction_policy
FrameExtractionPolicy

The policy for extracting frames.

sample_rate_fps
floatDefaults to 1.0

Frame rate for sampling the video

target_res
tuple[int, int]Defaults to (-1, -1)

The target resolution for the frames.

stream_idx
intDefaults to 0

PyAv index of the video stream to decode, usually 0.

video_format
str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

num_threads
intDefaults to 1

Number of threads to use for decoding.

Returns: npt.NDArray[np.uint8]

A numpy array of shape (num_frames, height, width, 3) containing the decoded

nemo_curator.utils.decoder_utils.extract_video_metadata(
video: str | bytes
) -> nemo_curator.utils.decoder_utils.VideoMetadata

Extract metadata from a video file using ffprobe.

Parameters:

video
str | bytes

Path to video file or video data as bytes.

Returns: VideoMetadata

VideoMetadata object containing video properties.

nemo_curator.utils.decoder_utils.find_closest_indices(
src: numpy.typing.NDArray[numpy.float32],
dst: numpy.typing.NDArray[numpy.float32]
) -> numpy.typing.NDArray[numpy.int32]

Find the closest indices in src to each element in dst.

If an element in dst is equidistant from two elements in src, the left index in src is used.

Parameters:

src
npt.NDArray[np.float32]

Monotonically increasing array of numbers to match dst against

dst
npt.NDArray[np.float32]

Monotonically increasing array of numbers to search for in src

Returns: npt.NDArray[np.int32]

Array of closest indices in src for each element in dst

nemo_curator.utils.decoder_utils.get_avg_frame_rate(
data: pathlib.Path | str | typing.BinaryIO | bytes,
stream_idx: int = 0,
video_format: str | None = None
) -> float

Get the average frame rate of a video.

Parameters:

data
Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

stream_idx
intDefaults to 0

Index of the video stream to decode, usually 0.

video_format
str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

Returns: float

The average frame rate of the video.

nemo_curator.utils.decoder_utils.get_frame_count(
data: pathlib.Path | str | typing.BinaryIO | bytes,
stream_idx: int = 0,
video_format: str | None = None
) -> int

Get the total number of frames in a video file or stream.

Parameters:

data
Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

stream_idx
intDefaults to 0

Index of the video stream to read from. Defaults to 0, which is typically the main video stream.

video_format
str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

Returns: int

The total number of frames in the video stream.

nemo_curator.utils.decoder_utils.get_video_timestamps(
data: pathlib.Path | str | typing.BinaryIO | bytes,
stream_idx: int = 0,
video_format: str | None = None
) -> numpy.typing.NDArray[numpy.float32]

Get timestamps for all frames in a video stream.

The file position will be moved as needed to get the timestamps.

Note: the order that frames appear in a video stream is not necessarily the order that the frames will be displayed. This means that timestamps are not monotonically increasing within a video stream. This can happen when B-frames are present

This function will return presentation timestamps in monotonically increasing order.

Parameters:

data
Path | str | BinaryIO | bytes

An open file, io.BytesIO, or bytes object with the video data.

stream_idx
intDefaults to 0

PyAv index of the video stream to decode, usually 0.

video_format
str | NoneDefaults to None

Format of the video stream, like “mp4”, “mkv”, etc. None is probably best

Returns: npt.NDArray[np.float32]

A numpy array of monotonically increasing timestamps.

nemo_curator.utils.decoder_utils.sample_closest(
src: numpy.typing.NDArray[numpy.float32],
sample_rate: float,
start: float | None = None,
stop: float | None = None,
endpoint: bool = True,
dedup: bool = True
) -> tuple[numpy.typing.NDArray[numpy.int32], numpy.typing.NDArray[numpy.int32], numpy.typing.NDArray[numpy.float32]]

Sample src at sample_rate rate and return the closest indices.

This function is meant to be used for sampling monotonically increasing numbers, like timestamps. This function can be used for synchronizing sensors, like multiple cameras, or synchronizing video with GPS and LIDAR.

The first element sampled with either or src[0] or the element closest to start

The last element sampled will either be src[-1] or the element closest to stop. The last element is only included if it both fits into the sampling rate and if endpoint=True

This function intentionally has no policy about distance from the closest elements in src to the sample elements. It will return the index of the closest element to the sample. It is up to the caller to define policy, which is why sample_elements is returned.

Parameters:

src
npt.NDArray[np.float32]

Monotonically increasing array of elements

sample_rate
float

Sampling rate

start
float | NoneDefaults to None

Start element (defaults to first element)

stop
float | NoneDefaults to None

End element (defaults to last element)

endpoint
boolDefaults to True

If True, stop can be the last sample, if it fits into the sample rate. If False, stop is not included in the output.

dedup
boolDefaults to True

Whether to deduplicate indices. Repeated indices will be reflected in the returned counts array.

Returns: npt.NDArray[np.int32]

Tuple of (indices, counts) where counts[i] is the number of times

nemo_curator.utils.decoder_utils.save_stream_position(
stream: typing.BinaryIO
) -> collections.abc.Generator[typing.BinaryIO, None, None]

Context manager that saves and restores stream position.