AudioTask Data Structure
This guide covers the AudioTask data structure, which serves as the core container for audio data throughout NeMo Curator’s audio processing pipeline.
Overview
AudioTask is a specialized data structure that extends NeMo Curator’s base Task class to handle audio-specific processing requirements. Each AudioTask holds a single manifest entry, matching the convention used by VideoTask and FileGroupTask:
- Single-Entry Model: One manifest entry per task (
Task[dict]), enabling straightforward per-sample processing - File Path Management: Automatically validates audio file existence and accessibility
- Metadata Handling: Preserves audio characteristics and processing results throughout pipeline stages
Structure and Components
Basic Structure
Key Attributes
Attribute-Style Access
AudioTask.data is an _AttrDict subclass, so you can access fields as attributes:
Data Validation
Automatic Validation
AudioTask provides built-in validation for audio data integrity. The _AttrDict data type enables hasattr-based validation, matching the pattern used by all other modalities.
Metadata Management
Standard Metadata Fields
Common fields stored in AudioTask data:
Character error rate (CER) is available as a utility function and typically requires a custom stage to compute and store it.
Error Handling
Graceful Failure Modes
AudioTask handles various error conditions:
Performance Characteristics
Memory Usage
AudioTask memory footprint is minimal since each task holds a single manifest entry. Memory scales with the number of metadata fields per entry and the total number of tasks processed in the pipeline.
Processing Patterns
Audio stages follow two processing patterns:
Integration with Processing Stages
Stage Input/Output
AudioTask serves as input and output for audio processing stages. All audio stages subclass ProcessingStage[AudioTask, AudioTask] directly:
Chaining Stages
AudioTask flows through multiple processing stages, with each stage adding new metadata fields: