Curate AudioTutorials

Beginner Audio Processing Tutorial

View as Markdown

Learn the basics of audio processing with NeMo Curator using the FLEURS multilingual speech dataset. This tutorial walks you through a complete audio processing pipeline from data loading to quality assessment and filtering.

Overview

This tutorial demonstrates the core audio curation workflow:

  1. Load Dataset: Download and prepare the FLEURS dataset
  2. ASR Inference: Transcribe audio using NeMo ASR models
  3. Quality Assessment: Calculate Word Error Rate (WER)
  4. Duration Analysis: Extract audio file durations
  5. Filtering: Keep only high-quality samples
  6. Export: Save processed results

What you’ll learn:

  • How to build an end-to-end audio curation pipeline
  • Loading multilingual speech datasets (FLEURS)
  • Running ASR inference with NeMo models
  • Calculating quality metrics (WER, duration)
  • Filtering audio by quality thresholds
  • Exporting curated results in JSONL format

Time to complete: Approximately 15-30 minutes (depending on dataset size and GPU availability)

Working Example Location

The complete working code for this tutorial is located at:

<nemo_curator_repository>/tutorials/audio/fleurs/
├── README.md # Tutorial documentation
├── pipeline.py # Main tutorial script
├── pipeline.yaml # Configuration file for run.py
└── run.py # Same as pipeline.py, but defines pipeline using YAML file instead

Accessing the code:

$# Clone NeMo Curator repository
$git clone https://github.com/NVIDIA/NeMo-Curator.git
$cd NeMo-Curator/tutorials/audio/fleurs/

Prerequisites

  • NeMo Curator installed (see Installation Guide)
  • NVIDIA GPU (required for ASR inference, minimum 16GB VRAM recommended)
  • Internet connection for dataset download
  • Basic Python knowledge
  • CUDA-compatible PyTorch installation
  • Sufficient disk space (FLEURS dataset requires ~10-50GB depending on language and split)

If you don’t have a GPU available, you can skip the ASR inference stage and work with pre-existing transcriptions. See the Custom Manifests guide for details.

Step-by-Step Walkthrough

Step 1: Import Required Modules

Import all necessary stages and components for the audio curation pipeline:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
4from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
5from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
6from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
7from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
8from nemo_curator.stages.text.io.writer import JsonlWriter
9from nemo_curator.stages.resources import Resources

Key components:

  • Pipeline: Container for organizing and executing processing stages
  • XennaExecutor: Backend executor for running the pipeline
  • CreateInitialManifestFleursStage: Downloads and prepares FLEURS dataset
  • InferenceAsrNemoStage: Runs ASR inference with NeMo models
  • GetPairwiseWerStage: Calculates Word Error Rate
  • PreserveByValueStage: Filters data based on threshold values
  • JsonlWriter: Exports results in JSONL format

Step 2: Create the Pipeline

Build the audio curation pipeline by adding stages in sequence:

1def create_audio_pipeline(args):
2 """Create audio curation pipeline."""
3
4 pipeline = Pipeline(name="audio_inference", description="Process FLEURS dataset with ASR")
5
6 # Stage 1: Load FLEURS dataset
7 pipeline.add_stage(
8 CreateInitialManifestFleursStage(
9 lang=args.lang, # e.g., "hy_am" for Armenian
10 split=args.split, # "dev", "train", or "test"
11 raw_data_dir=args.raw_data_dir
12 ).with_(batch_size=4) # Process 4 samples per batch
13 )
14
15 # Stage 2: ASR inference
16 pipeline.add_stage(
17 InferenceAsrNemoStage(
18 model_name=args.model_name, # e.g., "nvidia/stt_hy_fastconformer_hybrid_large_pc"
19 pred_text_key="pred_text" # Field name for ASR predictions
20 ).with_(resources=Resources(gpus=1.0)) # Allocate 1 GPU
21 )
22
23 # Stage 3: Calculate WER
24 pipeline.add_stage(
25 GetPairwiseWerStage(
26 text_key="text", # Ground truth field
27 pred_text_key="pred_text", # ASR prediction field
28 wer_key="wer" # Output WER field
29 )
30 )
31
32 # Stage 4: Extract duration
33 pipeline.add_stage(
34 GetAudioDurationStage(
35 audio_filepath_key="audio_filepath",
36 duration_key="duration"
37 )
38 )
39
40 # Stage 5: Filter by WER threshold
41 pipeline.add_stage(
42 PreserveByValueStage(
43 input_value_key="wer",
44 target_value=args.wer_threshold, # e.g., 75.0
45 operator="le" # less than or equal
46 )
47 )
48
49 # Stage 6: Convert to DocumentBatch for export
50 pipeline.add_stage(AudioToDocumentStage())
51
52 # Stage 7: Export results
53 result_dir = f"{args.raw_data_dir}/result"
54 pipeline.add_stage(
55 JsonlWriter(
56 path=result_dir,
57 write_kwargs={"force_ascii": False}
58 )
59 )
60
61 return pipeline

Stage explanations:

  1. CreateInitialManifestFleursStage: Downloads FLEURS dataset from Hugging Face and creates audio manifest
  2. InferenceAsrNemoStage: Loads NeMo ASR model and generates transcriptions (requires GPU)
  3. GetPairwiseWerStage: Compares ground truth and predictions to calculate WER
  4. GetAudioDurationStage: Reads audio files to extract duration metadata
  5. PreserveByValueStage: Filters samples, keeping only those with WER ≤ threshold
  6. AudioToDocumentStage: Converts AudioBatch to DocumentBatch format for export
  7. JsonlWriter: Saves filtered results as JSONL manifest

Step 3: Run the Pipeline

Configure pipeline parameters and execute:

1def main():
2 # Configuration
3 class Args:
4 lang = "hy_am" # Armenian language
5 split = "dev" # Development split
6 raw_data_dir = "/data/fleurs_output"
7 model_name = "nvidia/stt_hy_fastconformer_hybrid_large_pc"
8 wer_threshold = 75.0
9
10 args = Args()
11
12 # Create pipeline
13 pipeline = create_audio_pipeline(args)
14
15 # Create executor
16 executor = XennaExecutor()
17
18 # Run pipeline
19 pipeline.run(executor)
20
21 print("Pipeline completed!")
22 print(f"Results saved to: {args.raw_data_dir}/result/")
23
24if __name__ == "__main__":
25 main()

Configuration parameters:

  • lang: Language code from FLEURS dataset (e.g., “en_us”, “ko_kr”, “es_419”)
  • split: Dataset split to process (“dev”, “train”, or “test”)
  • raw_data_dir: Directory for downloading and storing FLEURS data
  • model_name: NeMo ASR model identifier from NGC or Hugging Face
  • wer_threshold: Maximum acceptable WER percentage (samples above this are filtered out)

The first run will download the FLEURS dataset for your selected language, which may take several minutes depending on network speed.

Running the Complete Example

To run the working tutorial:

$cd tutorials/audio/fleurs/
$
$python tutorials/audio/fleurs/pipeline.py \
> --raw_data_dir ./example_audio/fleurs \
> --model_name nvidia/stt_hy_fastconformer_hybrid_large_pc \
> --lang hy_am \
> --split dev \
> --wer_threshold 75 \
> --gpus 1 \
> --clean \
> --verbose

Command-line options:

  • --raw_data_dir: Output directory for dataset and results (required)
  • --lang: FLEURS language code (default: “hy_am”)
  • --split: Dataset split to process (default: “dev”)
  • --model_name: NeMo ASR model name (default: matches language)
  • --wer_threshold: Maximum WER for filtering (default: 75.0)

Expected execution time:

  • Dataset download (first run): 5-15 minutes
  • ASR inference: 1-5 minutes for dev split (~100 samples) with GPU
  • Quality assessment and export: < 1 minute

System requirements during execution:

  • GPU memory: 8-16GB (depending on model size)
  • Disk space: 10-50GB (dataset + results)
  • RAM: 8GB minimum

Understanding the Results

After running the pipeline, you’ll find:

  • Downloaded data: FLEURS audio files and transcriptions in <raw_data_dir>/downloads/
  • Processed manifest: JSONL file(s) with ASR predictions and quality metrics in <raw_data_dir>/result/
  • Filtered results: Only samples meeting the WER threshold

Example output entry:

1{
2 "audio_filepath": "/data/fleurs_output/dev/sample.wav",
3 "text": "բարև աշխարհ",
4 "pred_text": "բարև աշխարհ",
5 "wer": 0.0,
6 "duration": 2.3
7}

Field descriptions:

  • audio_filepath: Absolute path to audio file
  • text: Ground truth transcription from FLEURS
  • pred_text: ASR model prediction
  • wer: Word Error Rate percentage (0.0 = perfect match)
  • duration: Audio duration in seconds

Analyzing results:

$# Count filtered samples
$cat /data/fleurs_output/result/*.jsonl | wc -l
$
$# View first 5 samples
$head -n 5 /data/fleurs_output/result/*.jsonl | jq .
$
$# Calculate average WER
$cat /data/fleurs_output/result/*.jsonl | jq -r '.wer' | awk '{sum+=$1; count+=1} END {print "Average WER:", sum/count "%"}'

Using Python:

1import json
2import pandas as pd
3from pathlib import Path
4
5# Load results
6result_files = list(Path("/data/fleurs_output/result").glob("*.jsonl"))
7data = []
8for file in result_files:
9 with open(file, 'r') as f:
10 for line in f:
11 data.append(json.loads(line))
12
13df = pd.DataFrame(data)
14
15# Summary statistics
16print(f"Total samples: {len(df)}")
17print(f"Average WER: {df['wer'].mean():.2f}%")
18print(f"Average duration: {df['duration'].mean():.2f}s")
19print(f"WER range: {df['wer'].min():.2f}% - {df['wer'].max():.2f}%")

Troubleshooting

Common Issues

GPU out of memory:

RuntimeError: CUDA out of memory

Solution: Reduce batch size or use a smaller ASR model:

1pipeline.add_stage(
2 CreateInitialManifestFleursStage(...).with_(batch_size=2) # Reduce from 4 to 2
3)

Dataset download fails:

ConnectionError: Failed to download FLEURS dataset

Solution: Check internet connection and retry. The stage will resume from where it left off.

No GPU available:

RuntimeError: No CUDA GPUs are available

Solution: Ensure CUDA is installed and GPU is accessible:

$nvidia-smi # Check GPU availability
$python -c "import torch; print(torch.cuda.is_available())"

Model download fails:

OSError: Model 'nvidia/stt_...' not found

Solution: Verify model name is correct and you have internet access. Check available models at NGC Catalog.

Performance Optimization

  • Increase batch size for faster processing (if GPU memory allows)
  • Use multiple GPUs by setting Resources(gpus=2.0) or higher
  • Process subset of data by using split="dev" (smaller than “train”)
  • Skip ASR inference if you already have both predicted and target transcriptions. (remove InferenceAsrNemoStage)

Next Steps

After completing this tutorial, explore:

Best Practices

  • Start with dev split: Test your pipeline on the smaller development split before processing the full training set
  • Adjust WER thresholds by language: Some languages may require more lenient thresholds (e.g., 75-80% for low-resource languages)
  • Monitor GPU usage: Use nvidia-smi to track GPU memory and utilization during processing
  • Validate results: Always inspect a sample of output records to verify quality
  • Document parameters: Keep track of configuration values (thresholds, models) for reproducibility