nemo_curator.stages.audio.alm.alm_data_overlap

View as Markdown

ALM Data Overlap Stage - Native NeMo Curator Implementation.

Filters overlapping windows based on threshold. Follows the exact pattern from NeMo Curator: https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/audio/common.py

Produces identical output to SDP implementation.

Module Contents

Classes

NameDescription
ALMDataOverlapStageFilter overlapping ALM windows.

Functions

NameDescription
_calculate_duration_listCalculate list of durations from windows data.
_calculate_timestampsCalculate (end, start) timestamp pairs from windows data.
_calculate_total_durCalculate total duration from windows data.
_filter_segmentsFilter out segments that have overlap greater than threshold.
_get_filepath_from_stats-
_get_filtered_windowsGet complete window objects that correspond to filtered timestamps.
_overlap_ratioCalculate overlap ratio between two segments (stored as (end, start) tuples).
_process_filtered_durGet total duration of qualified segments.
_process_filtered_dur_listGet duration list of qualified segments.

Data

MAX_OVERLAP_PERCENTAGE

API

class nemo_curator.stages.audio.alm.alm_data_overlap.ALMDataOverlapStage(
name: str = 'alm_data_overlap',
overlap_percentage: int = 0,
target_duration: float = 120.0
)
Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Filter overlapping ALM windows.

Removes windows with overlap exceeding the threshold, keeping windows closest to target duration.

name
str = 'alm_data_overlap'
overlap_percentage
int = 0
target_duration
float = 120.0
nemo_curator.stages.audio.alm.alm_data_overlap.ALMDataOverlapStage.__post_init__() -> None

Validate parameters.

nemo_curator.stages.audio.alm.alm_data_overlap.ALMDataOverlapStage._filter_overlaps(
entry: dict[str, typing.Any]
) -> dict[str, typing.Any]

Filter overlapping windows from entry.

nemo_curator.stages.audio.alm.alm_data_overlap.ALMDataOverlapStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.alm.alm_data_overlap.ALMDataOverlapStage.process(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask
nemo_curator.stages.audio.alm.alm_data_overlap._calculate_duration_list(
windows: list[dict[str, typing.Any]]
) -> list[float]

Calculate list of durations from windows data.

nemo_curator.stages.audio.alm.alm_data_overlap._calculate_timestamps(
windows: list[dict[str, typing.Any]]
) -> list[tuple[float, float]]

Calculate (end, start) timestamp pairs from windows data.

nemo_curator.stages.audio.alm.alm_data_overlap._calculate_total_dur(
windows: list[dict[str, typing.Any]]
) -> float

Calculate total duration from windows data.

nemo_curator.stages.audio.alm.alm_data_overlap._filter_segments(
segments: list[tuple[float, float]],
threshold: float,
target_duration: float
) -> list[tuple[float, float]]

Filter out segments that have overlap greater than threshold.

nemo_curator.stages.audio.alm.alm_data_overlap._get_filepath_from_stats(
stats: dict[str, typing.Any] | None,
key: str
) -> str | None
nemo_curator.stages.audio.alm.alm_data_overlap._get_filtered_windows(
windows: list[dict[str, typing.Any]],
filtered_timestamps: list[tuple[float, float]]
) -> list[dict[str, typing.Any]]

Get complete window objects that correspond to filtered timestamps.

nemo_curator.stages.audio.alm.alm_data_overlap._overlap_ratio(
seg1: tuple[float, float],
seg2: tuple[float, float]
) -> float

Calculate overlap ratio between two segments (stored as (end, start) tuples).

nemo_curator.stages.audio.alm.alm_data_overlap._process_filtered_dur(
timestamps: list[tuple[float, float]]
) -> float

Get total duration of qualified segments.

nemo_curator.stages.audio.alm.alm_data_overlap._process_filtered_dur_list(
timestamps: list[tuple[float, float]]
) -> list[float]

Get duration list of qualified segments.

nemo_curator.stages.audio.alm.alm_data_overlap.MAX_OVERLAP_PERCENTAGE = 100