Audio Task | NeMo Curator

This guide covers the AudioTask data structure, which serves as the core container for audio data throughout NeMo Curator’s audio processing pipeline.

Overview

AudioTask is a specialized data structure that extends NeMo Curator’s base Task class to handle audio-specific processing requirements. Each AudioTask holds a single manifest entry, matching the convention used by VideoTask and FileGroupTask:

Single-Entry Model: One manifest entry per task (Task[dict]), enabling straightforward per-sample processing
File Path Management: Automatically validates audio file existence and accessibility
Metadata Handling: Preserves audio characteristics and processing results throughout pipeline stages

Structure and Components

Basic Structure

1 from nemo_curator.tasks import AudioTask
2 
3 # Create AudioTask with a single audio file
4 audio_task = AudioTask(
5     data={
6         "audio_filepath": "/path/to/audio.wav",
7         "text": "ground truth transcription",
8         "duration": 3.2,
9         "language": "en"
10     },
11     filepath_key="audio_filepath",
12     task_id="audio_task_001",
13     dataset_name="my_speech_dataset"
14 )

Key Attributes

Attribute	Type	Description
`data`	`dict`	Audio manifest entry (single dict, exposed as `_AttrDict` for attribute-style access)
`filepath_key`	`str \| None`	Key name for audio file paths in data (optional)
`task_id`	`str`	Unique identifier for the task
`dataset_name`	`str`	Name of the source dataset
`num_items`	`int`	Always returns `1` (read-only property)

Attribute-Style Access

AudioTask.data is an _AttrDict subclass, so you can access fields as attributes:

1 audio_task = AudioTask(data={"audio_filepath": "/path/to/audio.wav", "duration": 3.2})
2 
3 # Both access styles work
4 audio_task.data["audio_filepath"]  # dict-style
5 audio_task.data.audio_filepath     # attribute-style

Data Validation

Automatic Validation

AudioTask provides built-in validation for audio data integrity. The _AttrDict data type enables hasattr-based validation, matching the pattern used by all other modalities.

Metadata Management

Standard Metadata Fields

Common fields stored in AudioTask data:

1 audio_sample = {
2     # Core fields (user-provided)
3     "audio_filepath": "/path/to/audio.wav",
4     "text": "transcription text",
5 
6     # Fields added by processing stages
7     "pred_text": "asr prediction",    # Added by ASR inference stages
8     "wer": 12.5,                     # Added by GetPairwiseWerStage
9     "duration": 3.2,                 # Added by GetAudioDurationStage
10 
11     # Optional user-provided metadata
12     "language": "en_us",
13     "speaker_id": "speaker_001",
14 
15     # Custom fields (examples)
16     "domain": "conversational",
17     "noise_level": "low"
18 }

Character error rate (CER) is available as a utility function and typically requires a custom stage to compute and store it.

Error Handling

Graceful Failure Modes

AudioTask handles various error conditions:

1 # Missing files
2 audio_task = AudioTask(data={
3     "audio_filepath": "/missing/file.wav", "text": "sample"
4 })
5 # Validation fails, but processing continues with warnings
6 
7 # Corrupted audio files
8 corrupted_sample = {
9     "audio_filepath": "/corrupted/audio.wav",
10     "text": "sample text"
11 }
12 # Duration calculation returns -1.0 for corrupted files
13 
14 # Invalid metadata
15 invalid_sample = {
16     "audio_filepath": "/valid/audio.wav",
17     # Missing "text" field - needed for WER calculation but not enforced by AudioTask
18 }
19 # AudioTask does not enforce metadata field requirements. Add a validation stage if required.

Performance Characteristics

Memory Usage

AudioTask memory footprint is minimal since each task holds a single manifest entry. Memory scales with the number of metadata fields per entry and the total number of tasks processed in the pipeline.

Processing Patterns

Audio stages follow two processing patterns:

Pattern	Stages	Method
Per-task	CPU stages (`GetAudioDurationStage`, `GetPairwiseWerStage`)	`process(task) → AudioTask` — mutates `task.data` in-place
Batched	GPU stages (`InferenceAsrNemoStage`), IO stages (`AudioToDocumentStage`), filtering (`PreserveByValueStage`)	`process_batch(tasks) → list[AudioTask]`

Integration with Processing Stages

Stage Input/Output

AudioTask serves as input and output for audio processing stages. All audio stages subclass ProcessingStage[AudioTask, AudioTask] directly:

1 # CPU stage: mutates task in-place and returns it
2 def process(self, task: AudioTask) -> AudioTask:
3     duration = get_duration(task.data["audio_filepath"])
4     task.data["duration"] = duration
5     return task

Chaining Stages

AudioTask flows through multiple processing stages, with each stage adding new metadata fields: