Learn the basics of audio processing with NeMo Curator using the FLEURS multilingual speech dataset. This tutorial walks you through a complete audio processing pipeline from data loading to quality assessment and filtering.
This tutorial demonstrates the core audio curation workflow:
What you’ll learn:
Time to complete: Approximately 15-30 minutes (depending on dataset size and GPU availability)
The complete working code for this tutorial is located at:
Accessing the code:
If you don’t have a GPU available, you can skip the ASR inference stage and work with pre-existing transcriptions. See the Custom Manifests guide for details.
Import all necessary stages and components for the audio curation pipeline:
Key components:
Pipeline: Container for organizing and executing processing stagesXennaExecutor: Backend executor for running the pipelineCreateInitialManifestFleursStage: Downloads and prepares FLEURS datasetInferenceAsrNemoStage: Runs ASR inference with NeMo modelsGetPairwiseWerStage: Calculates Word Error RatePreserveByValueStage: Filters data based on threshold valuesJsonlWriter: Exports results in JSONL formatBuild the audio curation pipeline by adding stages in sequence:
Stage explanations:
Configure pipeline parameters and execute:
Configuration parameters:
lang: Language code from FLEURS dataset (e.g., “en_us”, “ko_kr”, “es_419”)split: Dataset split to process (“dev”, “train”, or “test”)raw_data_dir: Directory for downloading and storing FLEURS datamodel_name: NeMo ASR model identifier from NGC or Hugging Facewer_threshold: Maximum acceptable WER percentage (samples above this are filtered out)The first run will download the FLEURS dataset for your selected language, which may take several minutes depending on network speed.
To run the working tutorial:
Command-line options:
--raw_data_dir: Output directory for dataset and results (required)--lang: FLEURS language code (default: “hy_am”)--split: Dataset split to process (default: “dev”)--model_name: NeMo ASR model name (default: matches language)--wer_threshold: Maximum WER for filtering (default: 75.0)Expected execution time:
System requirements during execution:
After running the pipeline, you’ll find:
<raw_data_dir>/downloads/<raw_data_dir>/result/Example output entry:
Field descriptions:
audio_filepath: Absolute path to audio filetext: Ground truth transcription from FLEURSpred_text: ASR model predictionwer: Word Error Rate percentage (0.0 = perfect match)duration: Audio duration in secondsAnalyzing results:
Using Python:
GPU out of memory:
Solution: Reduce batch size or use a smaller ASR model:
Dataset download fails:
Solution: Check internet connection and retry. The stage will resume from where it left off.
No GPU available:
Solution: Ensure CUDA is installed and GPU is accessible:
Model download fails:
Solution: Verify model name is correct and you have internet access. Check available models at NGC Catalog.
Resources(gpus=2.0) or highersplit="dev" (smaller than “train”)After completing this tutorial, explore:
nvidia-smi to track GPU memory and utilization during processing