Audio Curation Pipeline (Overview)#
This guide provides an overview of the end-to-end audio curation workflow in NVIDIA NeMo Curator. It covers data ingestion and validation, optional ASR inference, quality assessment, filtering, and export or conversion. For detailed ASR pipeline information, refer to ASR Pipeline Architecture.
High-Level Flow#
graph TD
A[Audio Files] --> B[Ingest & Validation]
B --> C[Optional ASR Inference]
C --> D[Quality Metrics]
B --> D
D --> E[Filtering]
E --> F[Export & Conversion]
Core Components#
Data Ingestion and Validation:
AudioBatchfile existence checks usingvalidate()andvalidate_item()Manifest format validation and metadata consistency
Recommended JSONL manifest format
Optional ASR Inference:
InferenceAsrNemoStagefor automatic speech recognitionConfigurable batch processing with
batch_sizeandresourcesparametersSupport for multiple NeMo ASR models
Quality Assessment:
Audio duration analysis with
GetAudioDurationStageWord Error Rate (WER) and Character Error Rate (CER) calculation
Speech rate metrics including words per second and characters per second
Filtering and Quality Control:
Threshold-based filtering using
PreserveByValueStageConfigurable quality thresholds for WER, duration, and speech rate
Export and Format Conversion:
Audio-to-text conversion with
AudioToDocumentStageIntegration with text processing workflows
Common Workflows#
ASR-First Workflow (Most Common):
Load audio files into
AudioBatchformatApply ASR inference to generate transcriptions
Calculate quality metrics (WER, duration, speech rate)
Apply threshold-based filtering
Convert to
DocumentBatchfor text processing integrationExport filtered, high-quality audio-text pairs
Quality-First Workflow (No ASR Required):
Load audio files with existing transcriptions
Extract audio characteristics (duration, format, sample rate)
Apply basic quality filters
Export validated audio dataset