Format Validation | NeMo Curator

NeMo Curator audio processing stages use the soundfile library for audio file handling. Built-in error handling surfaces unreadable or unsupported files during duration calculation.

Supported Formats

Audio stages support formats compatible with the soundfile library (backed by libsndfile):

WAV: Uncompressed audio (recommended for high quality)
FLAC: Lossless compression with metadata support
OGG: Open-source compressed format
MP3: Compressed format (availability depends on your system’s libsndfile build)
AIFF: Apple uncompressed format

Note: AAC/M4A is not supported by default by soundfile/libsndfile. Prefer WAV or FLAC for consistent cross-platform behavior.

Built-in Error Handling

Duration Calculation with Error Handling

The GetAudioDurationStage automatically handles corrupted or unreadable files:

1 from nemo_curator.stages.audio.common import GetAudioDurationStage
2 
3 # Calculate duration with built-in error handling
4 duration_stage = GetAudioDurationStage(
5     audio_filepath_key="audio_filepath",
6     duration_key="duration"
7 )

Error Handling Behavior

When soundfile/libsndfile cannot read audio files:

Duration Calculation: Returns -1.0 for corrupted/unreadable files
ASR Inference: Will fail with clear error messages for unsupported formats
File Validation: Use duration = -1.0 as an indicator of file issues

1 from nemo_curator.stages.audio.common import PreserveByValueStage
2 
3 # Filter out corrupted files (duration = -1.0)
4 valid_files_filter = PreserveByValueStage(
5     input_value_key="duration",
6     target_value=0.0,
7     operator="gt"  # greater than 0
8 )

Working Example

Here is a complete pipeline that handles format validation through built-in error handling:

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
3 from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
4 
5 # Create pipeline with built-in error handling
6 pipeline = Pipeline(name="audio_validation")
7 
8 # 1. Calculate duration (automatically handles format validation)
9 pipeline.add_stage(GetAudioDurationStage(
10     audio_filepath_key="audio_filepath",
11     duration_key="duration"
12 ))
13 
14 # 2. Filter out corrupted files (duration = -1.0 indicates issues)
15 pipeline.add_stage(PreserveByValueStage(
16     input_value_key="duration",
17     target_value=0.0,
18     operator="gt"
19 ))
20 
21 # 3. Proceed with ASR inference on valid files only
22 pipeline.add_stage(InferenceAsrNemoStage(
23     model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
24 ))

Format Support Check

To check supported formats on your system:

1 import soundfile as sf
2 
3 # Check available formats
4 print("Supported formats:")
5 for format_name, format_info in sf.available_formats().items():
6     print(f"  {format_name}: {format_info}")
7 
8 # Check specific file
9 try:
10     info = sf.info("your_audio_file.wav")
11     print(f"File info: {info}")
12 except Exception as e:
13     print(f"File validation failed: {e}")

This approach leverages the built-in error handling of NeMo Curator’s audio stages rather than requiring extra format validation steps.