Audio Curation Pipeline (Overview)#
This guide provides an overview of the end-to-end audio curation workflow in NVIDIA NeMo Curator. It covers data ingestion and validation, optional ASR inference, quality assessment, filtering, and export or conversion. For detailed ASR pipeline information, refer to ASR Pipeline Architecture.
High-Level Flow#
graph TD A[Audio Files] --> B[Ingest & Validation] B --> C[Optional ASR Inference] C --> D[Quality Metrics] B --> D D --> E[Filtering] E --> F[Export & Conversion]
Core Components#
Data Ingestion and Validation:
AudioBatch
file existence checks usingvalidate()
andvalidate_item()
Manifest format validation and metadata consistency
Recommended JSONL manifest format
Optional ASR Inference:
InferenceAsrNemoStage
for automatic speech recognitionConfigurable batch processing with
batch_size
andresources
parametersSupport for multiple NeMo ASR models
Quality Assessment:
Audio duration analysis with
GetAudioDurationStage
Word Error Rate (WER) and Character Error Rate (CER) calculation
Speech rate metrics including words per second and characters per second
Filtering and Quality Control:
Threshold-based filtering using
PreserveByValueStage
Configurable quality thresholds for WER, duration, and speech rate
Export and Format Conversion:
Audio-to-text conversion with
AudioToDocumentStage
Integration with text processing workflows
Common Workflows#
ASR-First Workflow (Most Common):
Load audio files into
AudioBatch
formatApply ASR inference to generate transcriptions
Calculate quality metrics (WER, duration, speech rate)
Apply threshold-based filtering
Convert to
DocumentBatch
for text processing integrationExport filtered, high-quality audio-text pairs
Quality-First Workflow (No ASR Required):
Load audio files with existing transcriptions
Extract audio characteristics (duration, format, sample rate)
Apply basic quality filters
Export validated audio dataset