Audio Batch | NeMo Curator

This guide covers the AudioBatch data structure, which serves as the core container for audio data throughout NeMo Curator’s audio processing pipeline.

Overview

AudioBatch is a specialized data structure that extends NeMo Curator’s base Task class to handle audio-specific processing requirements:

File Path Management: Automatically validates audio file existence and accessibility
Batch Processing: Groups multiple audio samples for efficient parallel processing
Metadata Handling: Preserves audio characteristics and processing results throughout pipeline stages

Structure and Components

Basic Structure

1 from nemo_curator.tasks import AudioBatch
2 
3 # Create AudioBatch with single audio file
4 audio_batch = AudioBatch(
5     data={
6         "audio_filepath": "/path/to/audio.wav",
7         "text": "ground truth transcription",
8         "duration": 3.2,
9         "language": "en"
10     },
11     filepath_key="audio_filepath",
12     task_id="audio_task_001",
13     dataset_name="my_speech_dataset"
14 )
15 
16 # Create AudioBatch with multiple audio files
17 audio_batch = AudioBatch(
18     data=[
19         {
20             "audio_filepath": "/path/to/audio1.wav",
21             "text": "first transcription",
22             "duration": 2.1
23         },
24         {
25             "audio_filepath": "/path/to/audio2.wav",
26             "text": "second transcription",
27             "duration": 3.5
28         }
29     ],
30     filepath_key="audio_filepath",
31     task_id="audio_task_001",
32     dataset_name="my_speech_dataset"
33 )

Key Attributes

Attribute	Type	Description
`data`	`dict \| list[dict] \| None`	Audio sample data (stored internally as `list[dict]`)
`filepath_key`	`str \| None`	Key name for audio file paths in data (optional)
`task_id`	`str`	Unique identifier for the batch
`dataset_name`	`str`	Name of the source dataset
`num_items`	`int`	Number of audio samples in batch (read-only property)

Data Validation

Automatic Validation

AudioBatch provides built-in validation for audio data integrity.

Metadata Management

Standard Metadata Fields

Common fields stored in AudioBatch data:

1 audio_sample = {
2     # Core fields (user-provided)
3     "audio_filepath": "/path/to/audio.wav",
4     "text": "transcription text",
5 
6     # Fields added by processing stages
7     "pred_text": "asr prediction",    # Added by ASR inference stages
8     "wer": 12.5,                     # Added by GetPairwiseWerStage
9     "duration": 3.2,                 # Added by GetAudioDurationStage
10 
11     # Optional user-provided metadata
12     "language": "en_us",
13     "speaker_id": "speaker_001",
14 
15     # Custom fields (examples)
16     "domain": "conversational",
17     "noise_level": "low"
18 }

Character error rate (CER) is available as a utility function and typically requires a custom stage to compute and store it.

Error Handling

Graceful Failure Modes

AudioBatch handles various error conditions:

1 # Missing files
2 audio_batch = AudioBatch(data=[
3     {"audio_filepath": "/missing/file.wav", "text": "sample"}
4 ])
5 # Validation fails, but processing continues with warnings
6 
7 # Corrupted audio files
8 corrupted_sample = {
9     "audio_filepath": "/corrupted/audio.wav",
10     "text": "sample text"
11 }
12 # Duration calculation returns -1.0 for corrupted files
13 
14 # Invalid metadata
15 invalid_sample = {
16     "audio_filepath": "/valid/audio.wav",
17     # Missing "text" field - needed for WER calculation but not enforced by AudioBatch
18 }
19 # AudioBatch does not enforce metadata field requirements. Add a validation stage if required.

Error Recovery Strategies

1 def robust_audiobatch_creation(raw_data: list) -> AudioBatch:
2     """Create AudioBatch with error recovery."""
3 
4     valid_data = []
5     error_count = 0
6 
7     for item in raw_data:
8         try:
9             # Validate required fields
10             if "audio_filepath" not in item or "text" not in item:
11                 error_count += 1
12                 continue
13 
14             # Validate file existence
15             if not os.path.exists(item["audio_filepath"]):
16                 error_count += 1
17                 continue
18 
19             valid_data.append(item)
20 
21         except Exception as e:
22             logger.warning(f"Error processing item: {e}")
23             error_count += 1
24 
25     logger.info(f"Created AudioBatch with {len(valid_data)} valid items, {error_count} errors")
26 
27     return AudioBatch(
28         data=valid_data,
29         filepath_key="audio_filepath"
30     )

Performance Characteristics

Memory Usage

AudioBatch memory footprint depends on these factors:

Number of samples: Memory usage scales linearly with batch size
Metadata complexity: Additional metadata fields increase memory consumption
File path lengths: Longer file paths consume more memory
Audio file loading: Audio files are loaded on-demand and not cached in the batch

Processing Efficiency

Batch Size Impact:

Small batches:

Lower memory usage
Higher overhead per sample
Better for memory-constrained environments

Medium batches:

Balanced memory and performance
Good for most use cases
Optimal for CPU processing

Large batches:

Higher memory usage
Better GPU utilization
Optimal for GPU processing with sufficient VRAM

Integration with Processing Stages

Stage Input/Output

AudioBatch serves as input and output for audio processing stages:

1 # Stage processing signature
2 def process(self, task: AudioBatch) -> AudioBatch:
3     # Process audio data
4     processed_data = []
5 
6     for item in task.data:
7         # Apply processing logic
8         processed_item = self.process_audio_item(item)
9         processed_data.append(processed_item)
10 
11     # Return new AudioBatch with processed data
12     return AudioBatch(
13         data=processed_data,
14         filepath_key=task.filepath_key,
15         task_id=f"processed_{task.task_id}",
16         dataset_name=task.dataset_name
17     )

Chaining Stages

AudioBatch flows through multiple processing stages, with each stage adding new metadata fields: