About NeMo CuratorConceptsAudio Concepts

AudioTask Data Structure

View as Markdown

This guide covers the AudioTask data structure, which serves as the core container for audio data throughout NeMo Curator’s audio processing pipeline.

Overview

AudioTask is a specialized data structure that extends NeMo Curator’s base Task class to handle audio-specific processing requirements. Each AudioTask holds a single manifest entry, matching the convention used by VideoTask and FileGroupTask:

  • Single-Entry Model: One manifest entry per task (Task[dict]), enabling straightforward per-sample processing
  • File Path Management: Automatically validates audio file existence and accessibility
  • Metadata Handling: Preserves audio characteristics and processing results throughout pipeline stages

Structure and Components

Basic Structure

1from nemo_curator.tasks import AudioTask
2
3# Create AudioTask with a single audio file
4audio_task = AudioTask(
5 data={
6 "audio_filepath": "/path/to/audio.wav",
7 "text": "ground truth transcription",
8 "duration": 3.2,
9 "language": "en"
10 },
11 filepath_key="audio_filepath",
12 task_id="audio_task_001",
13 dataset_name="my_speech_dataset"
14)

Key Attributes

AttributeTypeDescription
datadictAudio manifest entry (single dict, exposed as _AttrDict for attribute-style access)
filepath_keystr | NoneKey name for audio file paths in data (optional)
task_idstrUnique identifier for the task
dataset_namestrName of the source dataset
num_itemsintAlways returns 1 (read-only property)

Attribute-Style Access

AudioTask.data is an _AttrDict subclass, so you can access fields as attributes:

1audio_task = AudioTask(data={"audio_filepath": "/path/to/audio.wav", "duration": 3.2})
2
3# Both access styles work
4audio_task.data["audio_filepath"] # dict-style
5audio_task.data.audio_filepath # attribute-style

Data Validation

Automatic Validation

AudioTask provides built-in validation for audio data integrity. The _AttrDict data type enables hasattr-based validation, matching the pattern used by all other modalities.

Metadata Management

Standard Metadata Fields

Common fields stored in AudioTask data:

1audio_sample = {
2 # Core fields (user-provided)
3 "audio_filepath": "/path/to/audio.wav",
4 "text": "transcription text",
5
6 # Fields added by processing stages
7 "pred_text": "asr prediction", # Added by ASR inference stages
8 "wer": 12.5, # Added by GetPairwiseWerStage
9 "duration": 3.2, # Added by GetAudioDurationStage
10
11 # Optional user-provided metadata
12 "language": "en_us",
13 "speaker_id": "speaker_001",
14
15 # Custom fields (examples)
16 "domain": "conversational",
17 "noise_level": "low"
18}

Character error rate (CER) is available as a utility function and typically requires a custom stage to compute and store it.

Error Handling

Graceful Failure Modes

AudioTask handles various error conditions:

1# Missing files
2audio_task = AudioTask(data={
3 "audio_filepath": "/missing/file.wav", "text": "sample"
4})
5# Validation fails, but processing continues with warnings
6
7# Corrupted audio files
8corrupted_sample = {
9 "audio_filepath": "/corrupted/audio.wav",
10 "text": "sample text"
11}
12# Duration calculation returns -1.0 for corrupted files
13
14# Invalid metadata
15invalid_sample = {
16 "audio_filepath": "/valid/audio.wav",
17 # Missing "text" field - needed for WER calculation but not enforced by AudioTask
18}
19# AudioTask does not enforce metadata field requirements. Add a validation stage if required.

Performance Characteristics

Memory Usage

AudioTask memory footprint is minimal since each task holds a single manifest entry. Memory scales with the number of metadata fields per entry and the total number of tasks processed in the pipeline.

Processing Patterns

Audio stages follow two processing patterns:

PatternStagesMethod
Per-taskCPU stages (GetAudioDurationStage, GetPairwiseWerStage)process(task) → AudioTask — mutates task.data in-place
BatchedGPU stages (InferenceAsrNemoStage), IO stages (AudioToDocumentStage), filtering (PreserveByValueStage)process_batch(tasks) → list[AudioTask]

Integration with Processing Stages

Stage Input/Output

AudioTask serves as input and output for audio processing stages. All audio stages subclass ProcessingStage[AudioTask, AudioTask] directly:

1# CPU stage: mutates task in-place and returns it
2def process(self, task: AudioTask) -> AudioTask:
3 duration = get_duration(task.data["audio_filepath"])
4 task.data["duration"] = duration
5 return task

Chaining Stages

AudioTask flows through multiple processing stages, with each stage adding new metadata fields: