***

description: >-
Calculate precise audio duration using soundfile library for quality
assessment and metadata generation
categories:

* processors
  tags:
* audio-analysis
* duration
* soundfile
* metadata
* quality-control
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: how-to
  modality: audio-only

***

# Duration Calculation

Calculate precise audio duration using the `soundfile` library for quality assessment and metadata generation in audio curation pipelines.

## Overview

The `GetAudioDurationStage` extracts precise timing information from audio files using the `soundfile` library. This information is essential for quality filtering, dataset analysis, and ensuring consistent audio lengths for training.

## Key Features

* **High Precision**: Uses `soundfile` for frame-accurate duration calculation
* **Format Support**: Works with all audio formats supported by `soundfile` (WAV, FLAC, OGG, and so on)
* **Error Handling**: Returns -1.0 for corrupted or unreadable files
* **Pipeline Integration**: Designed for use in NeMo Curator processing pipelines

## How It Works

The duration calculation stage reads audio samples and sample rate to determine exact duration:

```python
from nemo_curator.stages.audio.common import GetAudioDurationStage
from nemo_curator.tasks import AudioBatch

# Initialize duration calculator
duration_stage = GetAudioDurationStage(
    audio_filepath_key="audio_filepath",
    duration_key="duration"
)

# Process audio data
audio_data = {"audio_filepath": "/path/to/audio.wav", "text": "transcription"}
audio_batch = AudioBatch(data=[audio_data])
result_batch = duration_stage.process(audio_batch)

# Access duration information
duration = result_batch[0].data[0]["duration"]
print(f"Audio duration: {duration:.3f} seconds")
```

### Duration Calculation Process

1. **File Reading**: Uses `soundfile` to read audio samples and sample rate
2. **Frame Counting**: Counts total audio frames from the loaded samples
3. **Duration Calculation**: Computes duration as `frames ÷ sample_rate`
4. **Error Handling**: Sets duration to -1.0 for corrupted files

## Configuration

### Basic Configuration

```python
from nemo_curator.stages.audio.common import GetAudioDurationStage

# Configure duration calculation
duration_stage = GetAudioDurationStage(
    audio_filepath_key="audio_filepath",  # Field containing audio file paths
    duration_key="duration"               # Output field for duration values
)
```

### Custom Field Names

```python
# Use custom field names for your data format
duration_stage = GetAudioDurationStage(
    audio_filepath_key="wav_file_path",   # Custom input field
    duration_key="audio_length_seconds"   # Custom output field
)
```

## Usage Examples

### Basic Duration Calculation

```python
from nemo_curator.stages.audio.common import GetAudioDurationStage
from nemo_curator.tasks import AudioBatch

# Sample audio data
audio_samples = [
    {"audio_filepath": "/path/to/sample1.wav", "text": "Hello world"},
    {"audio_filepath": "/path/to/sample2.wav", "text": "How are you"},
    {"audio_filepath": "/path/to/sample3.wav", "text": "Good morning"}
]

# Create duration calculation stage
duration_stage = GetAudioDurationStage(
    audio_filepath_key="audio_filepath",
    duration_key="duration"
)

# Process each sample
for sample in audio_samples:
    audio_batch = AudioBatch(data=[sample])
    result_batch = duration_stage.process(audio_batch)

    processed_sample = result_batch[0].data[0]
    print(f"File: {processed_sample['audio_filepath']}")
    print(f"Duration: {processed_sample['duration']:.3f} seconds")
```

### Pipeline Integration

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage

# Create audio processing pipeline
pipeline = Pipeline(name="audio_duration_pipeline")

# Add duration calculation stage
pipeline.add_stage(GetAudioDurationStage(
    audio_filepath_key="audio_filepath",
    duration_key="duration"
))

# Add duration-based filtering (1-30 seconds)
pipeline.add_stage(PreserveByValueStage(
    input_value_key="duration",
    target_value=1.0,
    operator="ge"  # greater than or equal
))

pipeline.add_stage(PreserveByValueStage(
    input_value_key="duration",
    target_value=30.0,
    operator="le"  # less than or equal
))
```

### Batch Processing

```python
from nemo_curator.stages.audio.common import GetAudioDurationStage
from nemo_curator.tasks import AudioBatch

# Process multiple samples in a batch
audio_data_list = [
    {"audio_filepath": "/path/to/file1.wav", "text": "Sample 1"},
    {"audio_filepath": "/path/to/file2.wav", "text": "Sample 2"},
    {"audio_filepath": "/path/to/file3.wav", "text": "Sample 3"}
]

# Create batch
audio_batch = AudioBatch(data=audio_data_list)

# Process entire batch
duration_stage = GetAudioDurationStage(
    audio_filepath_key="audio_filepath",
    duration_key="duration"
)

# Process returns list of AudioBatch objects
result_batches = duration_stage.process(audio_batch)

# Extract processed data
for batch in result_batches:
    for sample in batch.data:
        print(f"File: {sample['audio_filepath']}")
        print(f"Duration: {sample['duration']:.3f} seconds")
```

## Output Format

The stage adds duration information to each audio sample's metadata:

```json
{
  "audio_filepath": "/path/to/audio.wav",
  "text": "Sample transcription text",
  "duration": 12.345
}
```

For corrupted or unreadable files:

```json
{
  "audio_filepath": "/path/to/corrupted.wav",
  "text": "Sample transcription text",
  "duration": -1.0
}
```

## Error Handling

The stage handles various error conditions:

### File Not Found

```python
# Non-existent files result in duration = -1.0
sample = {"audio_filepath": "/nonexistent/file.wav", "text": "test"}
audio_batch = AudioBatch(data=[sample])
result = duration_stage.process(audio_batch)
# result[0].data[0]["duration"] == -1.0
```

### Corrupted Audio Files

```python
# Corrupted files are logged and marked with duration = -1.0
# Check logs for specific error messages
import logging
logging.basicConfig(level=logging.WARNING)

# Process will continue with other files
result = duration_stage.process(audio_batch)
```

### Filtering Error Files

```python
from nemo_curator.stages.audio.common import PreserveByValueStage

# Filter out files with calculation errors
error_filter = PreserveByValueStage(
    input_value_key="duration",
    target_value=0.0,
    operator="gt"  # greater than (excludes -1.0 error values)
)
```

## Integration with Quality Assessment

Duration calculation is typically the first step in quality assessment workflows:

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage

# Create comprehensive quality pipeline
pipeline = Pipeline(name="audio_quality_assessment")

# Step 1: Calculate durations
pipeline.add_stage(GetAudioDurationStage(
    audio_filepath_key="audio_filepath",
    duration_key="duration"
))

# Step 2: Filter by duration range (optimal for ASR training)
pipeline.add_stage(PreserveByValueStage(
    input_value_key="duration",
    target_value=1.0,      # Minimum 1 second
    operator="ge"
))

pipeline.add_stage(PreserveByValueStage(
    input_value_key="duration",
    target_value=15.0,     # Maximum 15 seconds
    operator="le"
))

# Step 3: Remove error files
pipeline.add_stage(PreserveByValueStage(
    input_value_key="duration",
    target_value=0.0,      # Exclude -1.0 error values
    operator="gt"
))
```

## Performance Considerations

### Memory Usage

* The stage reads audio samples to compute frames
* Memory usage scales with file duration, channels, and data type
* Reduce batch size when processing large files or large batches of files
* For a custom alternative that avoids loading samples, use `soundfile.info` to get `frames` and `samplerate`

### Processing Speed

* Duration calculation is I/O bound and scales with file size
* Network-mounted files can be slower than local storage
* Consider parallel processing for large datasets using Ray

### File System Optimization

For better performance with large datasets:

* Use local storage when possible
* Ensure sufficient I/O bandwidth
* Consider file system caching

## Troubleshooting

### Common Issues

#### Unsupported Audio Formats

```python
# Check supported formats
import soundfile as sf
print("Supported formats:", sf.available_formats())

# Common supported formats: WAV, FLAC, OGG, AIFF
# MP3 support depends on your system's libsndfile build
```
