Beginner Audio Processing Tutorial
Learn the basics of audio processing with NeMo Curator using the FLEURS multilingual speech dataset. This tutorial walks you through a complete audio processing pipeline from data loading to quality assessment and filtering.
Overview
This tutorial demonstrates the core audio curation workflow:
- Load Dataset: Download and prepare the FLEURS dataset
- ASR Inference: Transcribe audio using NeMo ASR models
- Quality Assessment: Calculate Word Error Rate (WER)
- Duration Analysis: Extract audio file durations
- Filtering: Keep only high-quality samples
- Export: Save processed results
What you’ll learn:
- How to build an end-to-end audio curation pipeline
- Loading multilingual speech datasets (FLEURS)
- Running ASR inference with NeMo models
- Calculating quality metrics (WER, duration)
- Filtering audio by quality thresholds
- Exporting curated results in JSONL format
Time to complete: Approximately 15-30 minutes (depending on dataset size and GPU availability)
Working Example Location
The complete working code for this tutorial is located at:
Accessing the code:
Prerequisites
- NeMo Curator installed (see Installation Guide)
- NVIDIA GPU (required for ASR inference, minimum 16GB VRAM recommended)
- Internet connection for dataset download
- Basic Python knowledge
- CUDA-compatible PyTorch installation
- Sufficient disk space (FLEURS dataset requires ~10-50GB depending on language and split)
If you don’t have a GPU available, you can skip the ASR inference stage and work with pre-existing transcriptions. See the Custom Manifests guide for details.
Step-by-Step Walkthrough
Step 1: Import Required Modules
Import all necessary stages and components for the audio curation pipeline:
Key components:
Pipeline: Container for organizing and executing processing stagesXennaExecutor: Backend executor for running the pipelineCreateInitialManifestFleursStage: Downloads and prepares FLEURS datasetInferenceAsrNemoStage: Runs ASR inference with NeMo modelsGetPairwiseWerStage: Calculates Word Error RatePreserveByValueStage: Filters data based on threshold valuesJsonlWriter: Exports results in JSONL format
Step 2: Create the Pipeline
Build the audio curation pipeline by adding stages in sequence:
Stage explanations:
- CreateInitialManifestFleursStage: Downloads FLEURS dataset from Hugging Face and creates audio manifest
- InferenceAsrNemoStage: Loads NeMo ASR model and generates transcriptions (requires GPU)
- GetPairwiseWerStage: Compares ground truth and predictions to calculate WER
- GetAudioDurationStage: Reads audio files to extract duration metadata
- PreserveByValueStage: Filters samples, keeping only those with WER ≤ threshold
- AudioToDocumentStage: Converts AudioBatch to DocumentBatch format for export
- JsonlWriter: Saves filtered results as JSONL manifest
Step 3: Run the Pipeline
Configure pipeline parameters and execute:
Configuration parameters:
lang: Language code from FLEURS dataset (e.g., “en_us”, “ko_kr”, “es_419”)split: Dataset split to process (“dev”, “train”, or “test”)raw_data_dir: Directory for downloading and storing FLEURS datamodel_name: NeMo ASR model identifier from NGC or Hugging Facewer_threshold: Maximum acceptable WER percentage (samples above this are filtered out)
The first run will download the FLEURS dataset for your selected language, which may take several minutes depending on network speed.
Running the Complete Example
To run the working tutorial:
Command-line options:
--raw_data_dir: Output directory for dataset and results (required)--lang: FLEURS language code (default: “hy_am”)--split: Dataset split to process (default: “dev”)--model_name: NeMo ASR model name (default: matches language)--wer_threshold: Maximum WER for filtering (default: 75.0)
Expected execution time:
- Dataset download (first run): 5-15 minutes
- ASR inference: 1-5 minutes for dev split (~100 samples) with GPU
- Quality assessment and export: < 1 minute
System requirements during execution:
- GPU memory: 8-16GB (depending on model size)
- Disk space: 10-50GB (dataset + results)
- RAM: 8GB minimum
Understanding the Results
After running the pipeline, you’ll find:
- Downloaded data: FLEURS audio files and transcriptions in
<raw_data_dir>/downloads/ - Processed manifest: JSONL file(s) with ASR predictions and quality metrics in
<raw_data_dir>/result/ - Filtered results: Only samples meeting the WER threshold
Example output entry:
Field descriptions:
audio_filepath: Absolute path to audio filetext: Ground truth transcription from FLEURSpred_text: ASR model predictionwer: Word Error Rate percentage (0.0 = perfect match)duration: Audio duration in seconds
Analyzing results:
Using Python:
Troubleshooting
Common Issues
GPU out of memory:
Solution: Reduce batch size or use a smaller ASR model:
Dataset download fails:
Solution: Check internet connection and retry. The stage will resume from where it left off.
No GPU available:
Solution: Ensure CUDA is installed and GPU is accessible:
Model download fails:
Solution: Verify model name is correct and you have internet access. Check available models at NGC Catalog.
Performance Optimization
- Increase batch size for faster processing (if GPU memory allows)
- Use multiple GPUs by setting
Resources(gpus=2.0)or higher - Process subset of data by using
split="dev"(smaller than “train”) - Skip ASR inference if you already have both predicted and target transcriptions. (remove InferenceAsrNemoStage)
Next Steps
After completing this tutorial, explore:
- Custom Manifests: Process your own audio datasets
- WER Filtering: Advanced quality filtering techniques
- Duration Filtering: Filter by audio length and speech rate
- NeMo ASR Models: Explore available ASR models for different languages
Best Practices
- Start with dev split: Test your pipeline on the smaller development split before processing the full training set
- Adjust WER thresholds by language: Some languages may require more lenient thresholds (e.g., 75-80% for low-resource languages)
- Monitor GPU usage: Use
nvidia-smito track GPU memory and utilization during processing - Validate results: Always inspect a sample of output records to verify quality
- Document parameters: Keep track of configuration values (thresholds, models) for reproducibility
Related Topics
- Audio Curation Quickstart: Quick introduction to audio curation
- FLEURS Dataset: Detailed FLEURS dataset documentation
- Quality Assessment: Comprehensive quality metrics guide
- Save & Export: Advanced export options and formats