Beginner Audio Processing Tutorial#
Learn the basics of audio processing with NeMo Curator using the FLEURS multilingual speech dataset. This tutorial walks you through a complete audio processing pipeline from data loading to quality assessment and filtering.
Overview#
This tutorial demonstrates the core audio processing workflow:
Load Dataset: Download and prepare the FLEURS dataset
ASR Inference: Transcribe audio using NeMo ASR models
Quality Assessment: Calculate Word Error Rate (WER)
Duration Analysis: Extract audio file durations
Filtering: Keep only high-quality samples
Export: Save processed results
Working Example Location#
The complete working code for this tutorial is located at:
tutorials/audio/fleurs/
Prerequisites#
NeMo Curator installed
NVIDIA GPU (recommended for ASR inference)
Internet connection for dataset download
Basic Python knowledge
Step-by-Step Walkthrough#
Step 1: Import Required Modules#
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter
Step 2: Create the Pipeline#
def create_audio_pipeline(args):
"""Create audio processing pipeline."""
pipeline = Pipeline(name="audio_inference", description="Process FLEURS dataset with ASR")
# Stage 1: Load FLEURS dataset
pipeline.add_stage(
CreateInitialManifestFleursStage(
lang=args.lang, # e.g., "hy_am" for Armenian
split=args.split, # "dev", "train", or "test"
raw_data_dir=args.raw_data_dir
)
)
# Stage 2: ASR inference
pipeline.add_stage(
InferenceAsrNemoStage(
model_name=args.model_name # e.g., "nvidia/stt_hy_fastconformer_hybrid_large_pc"
)
)
# Stage 3: Calculate WER
pipeline.add_stage(
GetPairwiseWerStage(
text_key="text", # Ground truth field
pred_text_key="pred_text", # ASR prediction field
wer_key="wer" # Output WER field
)
)
# Stage 4: Extract duration
pipeline.add_stage(
GetAudioDurationStage(
audio_filepath_key="audio_filepath",
duration_key="duration"
)
)
# Stage 5: Filter by WER threshold
pipeline.add_stage(
PreserveByValueStage(
input_value_key="wer",
target_value=args.wer_threshold, # e.g., 75.0
operator="le" # less than or equal
)
)
# Stage 6: Convert to DocumentBatch for export
pipeline.add_stage(AudioToDocumentStage())
# Stage 7: Export results
result_dir = f"{args.raw_data_dir}/result"
pipeline.add_stage(
JsonlWriter(
path=result_dir,
write_kwargs={"force_ascii": False}
)
)
return pipeline
Step 3: Run the Pipeline#
def main():
# Configuration
class Args:
lang = "hy_am" # Armenian language
split = "dev" # Development split
raw_data_dir = "/data/fleurs_output"
model_name = "nvidia/stt_hy_fastconformer_hybrid_large_pc"
wer_threshold = 75.0
args = Args()
# Create and run pipeline
pipeline = create_audio_pipeline(args)
pipeline.run()
print("Pipeline completed!")
if __name__ == "__main__":
main()
Running the Complete Example#
To run the working tutorial:
cd tutorials/audio/fleurs/
# Basic run with default settings
python run.py --raw_data_dir /data/fleurs_output
# Customize parameters
python run.py \
--raw_data_dir /data/fleurs_output \
--lang ko_kr \
--split train \
--model_name nvidia/stt_ko_fastconformer_hybrid_large_pc \
--wer_threshold 50.0
Understanding the Results#
After running the pipeline, you’ll find:
Downloaded data: FLEURS audio files and transcriptions
Processed manifest: JSONL file with ASR predictions and quality metrics
Filtered results: Only samples meeting the WER threshold
Example output entry:
{
"audio_filepath": "/data/fleurs_output/dev/sample.wav",
"text": "բարև աշխարհ",
"pred_text": "բարև աշխարհ",
"wer": 0.0,
"duration": 2.3
}