Audio Curation Pipeline (Overview)
This guide provides an overview of the end-to-end audio curation workflow in NVIDIA NeMo Curator. It covers data ingestion and validation, optional ASR inference, quality assessment, filtering, and export or conversion. For detailed ASR pipeline information, refer to ASR Pipeline.
High-Level Flow
Core Components
Data Ingestion and Validation:
AudioBatchfile existence checks usingvalidate()andvalidate_item()- Manifest format validation and metadata consistency
- Recommended JSONL manifest format
Optional ASR Inference:
InferenceAsrNemoStagefor automatic speech recognition- Configurable batch processing with
batch_sizeandresourcesparameters - Support for multiple NeMo ASR models
Quality Assessment:
- Audio duration analysis with
GetAudioDurationStage - Word Error Rate (WER) and Character Error Rate (CER) calculation
- Speech rate metrics including words per second and characters per second
Filtering and Quality Control:
- Threshold-based filtering using
PreserveByValueStage - Configurable quality thresholds for WER, duration, and speech rate
Export and Format Conversion:
- Audio-to-text conversion with
AudioToDocumentStage - Integration with text processing workflows
Common Workflows
ASR-First Workflow (Most Common):
- Load audio files into
AudioBatchformat - Apply ASR inference to generate transcriptions
- Calculate quality metrics (WER, duration, speech rate)
- Apply threshold-based filtering
- Convert to
DocumentBatchfor text processing integration - Export filtered, high-quality audio-text pairs
Quality-First Workflow (No ASR Required):
- Load audio files with existing transcriptions
- Extract audio characteristics (duration, format, sample rate)
- Apply basic quality filters
- Export validated audio dataset