About NeMo CuratorConceptsAudio Concepts

AudioBatch Data Structure

View as Markdown

This guide covers the AudioBatch data structure, which serves as the core container for audio data throughout NeMo Curator’s audio processing pipeline.

Overview

AudioBatch is a specialized data structure that extends NeMo Curator’s base Task class to handle audio-specific processing requirements:

  • File Path Management: Automatically validates audio file existence and accessibility
  • Batch Processing: Groups multiple audio samples for efficient parallel processing
  • Metadata Handling: Preserves audio characteristics and processing results throughout pipeline stages

Structure and Components

Basic Structure

1from nemo_curator.tasks import AudioBatch
2
3# Create AudioBatch with single audio file
4audio_batch = AudioBatch(
5 data={
6 "audio_filepath": "/path/to/audio.wav",
7 "text": "ground truth transcription",
8 "duration": 3.2,
9 "language": "en"
10 },
11 filepath_key="audio_filepath",
12 task_id="audio_task_001",
13 dataset_name="my_speech_dataset"
14)
15
16# Create AudioBatch with multiple audio files
17audio_batch = AudioBatch(
18 data=[
19 {
20 "audio_filepath": "/path/to/audio1.wav",
21 "text": "first transcription",
22 "duration": 2.1
23 },
24 {
25 "audio_filepath": "/path/to/audio2.wav",
26 "text": "second transcription",
27 "duration": 3.5
28 }
29 ],
30 filepath_key="audio_filepath",
31 task_id="audio_task_001",
32 dataset_name="my_speech_dataset"
33)

Key Attributes

AttributeTypeDescription
datadict | list[dict] | NoneAudio sample data (stored internally as list[dict])
filepath_keystr | NoneKey name for audio file paths in data (optional)
task_idstrUnique identifier for the batch
dataset_namestrName of the source dataset
num_itemsintNumber of audio samples in batch (read-only property)

Data Validation

Automatic Validation

AudioBatch provides built-in validation for audio data integrity.

Metadata Management

Standard Metadata Fields

Common fields stored in AudioBatch data:

1audio_sample = {
2 # Core fields (user-provided)
3 "audio_filepath": "/path/to/audio.wav",
4 "text": "transcription text",
5
6 # Fields added by processing stages
7 "pred_text": "asr prediction", # Added by ASR inference stages
8 "wer": 12.5, # Added by GetPairwiseWerStage
9 "duration": 3.2, # Added by GetAudioDurationStage
10
11 # Optional user-provided metadata
12 "language": "en_us",
13 "speaker_id": "speaker_001",
14
15 # Custom fields (examples)
16 "domain": "conversational",
17 "noise_level": "low"
18}

Character error rate (CER) is available as a utility function and typically requires a custom stage to compute and store it.

Error Handling

Graceful Failure Modes

AudioBatch handles various error conditions:

1# Missing files
2audio_batch = AudioBatch(data=[
3 {"audio_filepath": "/missing/file.wav", "text": "sample"}
4])
5# Validation fails, but processing continues with warnings
6
7# Corrupted audio files
8corrupted_sample = {
9 "audio_filepath": "/corrupted/audio.wav",
10 "text": "sample text"
11}
12# Duration calculation returns -1.0 for corrupted files
13
14# Invalid metadata
15invalid_sample = {
16 "audio_filepath": "/valid/audio.wav",
17 # Missing "text" field - needed for WER calculation but not enforced by AudioBatch
18}
19# AudioBatch does not enforce metadata field requirements. Add a validation stage if required.

Error Recovery Strategies

1def robust_audiobatch_creation(raw_data: list) -> AudioBatch:
2 """Create AudioBatch with error recovery."""
3
4 valid_data = []
5 error_count = 0
6
7 for item in raw_data:
8 try:
9 # Validate required fields
10 if "audio_filepath" not in item or "text" not in item:
11 error_count += 1
12 continue
13
14 # Validate file existence
15 if not os.path.exists(item["audio_filepath"]):
16 error_count += 1
17 continue
18
19 valid_data.append(item)
20
21 except Exception as e:
22 logger.warning(f"Error processing item: {e}")
23 error_count += 1
24
25 logger.info(f"Created AudioBatch with {len(valid_data)} valid items, {error_count} errors")
26
27 return AudioBatch(
28 data=valid_data,
29 filepath_key="audio_filepath"
30 )

Performance Characteristics

Memory Usage

AudioBatch memory footprint depends on these factors:

  • Number of samples: Memory usage scales linearly with batch size
  • Metadata complexity: Additional metadata fields increase memory consumption
  • File path lengths: Longer file paths consume more memory
  • Audio file loading: Audio files are loaded on-demand and not cached in the batch

Processing Efficiency

Batch Size Impact:

Small batches:

  • Lower memory usage
  • Higher overhead per sample
  • Better for memory-constrained environments

Medium batches:

  • Balanced memory and performance
  • Good for most use cases
  • Optimal for CPU processing

Large batches:

  • Higher memory usage
  • Better GPU utilization
  • Optimal for GPU processing with sufficient VRAM

Integration with Processing Stages

Stage Input/Output

AudioBatch serves as input and output for audio processing stages:

1# Stage processing signature
2def process(self, task: AudioBatch) -> AudioBatch:
3 # Process audio data
4 processed_data = []
5
6 for item in task.data:
7 # Apply processing logic
8 processed_item = self.process_audio_item(item)
9 processed_data.append(processed_item)
10
11 # Return new AudioBatch with processed data
12 return AudioBatch(
13 data=processed_data,
14 filepath_key=task.filepath_key,
15 task_id=f"processed_{task.task_id}",
16 dataset_name=task.dataset_name
17 )

Chaining Stages

AudioBatch flows through multiple processing stages, with each stage adding new metadata fields: