This guide covers the AudioTask data structure, which serves as the core container for audio data throughout NeMo Curator’s audio processing pipeline.
AudioTask is a specialized data structure that extends NeMo Curator’s base Task class to handle audio-specific processing requirements. Each AudioTask holds a single manifest entry, matching the convention used by VideoTask and FileGroupTask:
Task[dict]), enabling straightforward per-sample processingAudioTask.data is an _AttrDict subclass, so you can access fields as attributes:
AudioTask provides built-in validation for audio data integrity. The _AttrDict data type enables hasattr-based validation, matching the pattern used by all other modalities.
Common fields stored in AudioTask data:
Character error rate (CER) is available as a utility function and typically requires a custom stage to compute and store it.
AudioTask handles various error conditions:
AudioTask memory footprint is minimal since each task holds a single manifest entry. Memory scales with the number of metadata fields per entry and the total number of tasks processed in the pipeline.
Audio stages follow two processing patterns:
AudioTask serves as input and output for audio processing stages. All audio stages subclass ProcessingStage[AudioTask, AudioTask] directly:
AudioTask flows through multiple processing stages, with each stage adding new metadata fields: