> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

> ALMDataOverlapStage reference for removing overlapping training windows based on configurable thresholds

# ALM Overlap Filtering

`ALMDataOverlapStage` removes redundant training windows that share too much audio content. When two windows overlap beyond a configurable threshold, the stage keeps the window whose duration is closest to the target and discards the other.

## How it Works

The stage processes each `AudioTask` independently:

1. Extracts the `windows` list produced by `ALMDataBuilderStage`
2. Sorts windows by start time
3. For each window, compares it against every later window whose start falls before its end — all pairs that overlap in time, not only adjacent ones — and calculates the overlap ratio (overlap duration divided by the shorter window duration)
4. When the overlap ratio meets the threshold, greedily removes the window whose duration is further from `target_duration`
5. Writes filtered results back to the task

## Parameters

| Parameter            | Type  | Default | Description                                                        |
| -------------------- | ----- | ------- | ------------------------------------------------------------------ |
| `overlap_percentage` | int   | 0       | Overlap threshold from 0 to 100. Lower values remove more windows. |
| `target_duration`    | float | 120.0   | Preferred window duration in seconds, used for tie-breaking        |

### Overlap Percentage Behavior

| Value | Behavior                                                         | Typical Use Case                       |
| ----- | ---------------------------------------------------------------- | -------------------------------------- |
| 0     | Remove any overlapping windows                                   | Maximum deduplication, smallest output |
| 50    | Remove windows with 50% or more overlap                          | Balanced yield and diversity           |
| 100   | Keep all windows except fully-contained duplicates (ratio = 1.0) | Minimum filtering, largest output      |

## Basic Usage

```python
from nemo_curator.stages.audio.alm import ALMDataOverlapStage

# Remove windows with any overlap
overlap_filter = ALMDataOverlapStage(
    overlap_percentage=0,
    target_duration=120.0,
)
```

## Advanced Configuration

### Moderate Filtering

```python
# Keep windows unless they overlap by more than 50%
overlap_filter = ALMDataOverlapStage(
    overlap_percentage=50,
    target_duration=120.0,
)
```

### Short-Window Pipeline

When using shorter target windows, match the `target_duration` parameter:

```python
overlap_filter = ALMDataOverlapStage(
    overlap_percentage=30,
    target_duration=30.0,  # Match ALMDataBuilderStage target
)
```

## Output Fields

The stage adds the following user-facing fields to each `AudioTask`:

| Field               | Type  | Description                                                 |
| ------------------- | ----- | ----------------------------------------------------------- |
| `filtered_windows`  | list  | Windows that passed overlap filtering                       |
| `filtered_dur`      | float | Total duration of filtered windows in seconds               |
| `filtered_dur_list` | list  | Duration of each individual filtered window                 |
| `total_dur_window`  | float | Total duration of all input windows before filtering        |
| `manifest_filepath` | str   | Source manifest path carried through from the builder stage |

The stage also writes several intermediate fields (`total_dur_list_window`, `total_dur_list_window_timestamps`, `filtered`, `swift_filepath`) that are primarily used for internal bookkeeping. The original `windows` list produced by `ALMDataBuilderStage` is preserved so downstream consumers can compare pre- and post-filter results.

## Tuning the Overlap Threshold

The right threshold depends on your training requirements:

* **For diverse training data**, use a low `overlap_percentage` (0 to 30) to maximize the variety of audio content in the training set
* **For maximum training volume**, use a higher `overlap_percentage` (70 to 100) to retain more windows at the cost of some redundancy
* **For balanced results**, use `overlap_percentage=50` as a starting point and adjust based on the ratio of `filtered_windows` to input `windows`

Monitor the yield by comparing `filtered_dur` to `total_dur_window` in the output.

## Related Topics

* **[ALM Data Builder](/curate-audio/process-data/alm/data-builder)**: Previous stage in the ALM pipeline
* **[ALM Pipeline Concepts](/about/concepts/audio/alm-pipeline)**: Architectural overview
* **[ALM Tutorial](/curate-audio/tutorials/alm)**: End-to-end walkthrough with sample data