Curate AudioLoad Data

Load Local Audio Files

View as Markdown

Load audio files from local directories by creating custom manifests that reference your audio files. This guide covers supported formats and basic approaches for organizing local audio data for NeMo Curator processing.

Overview

To process local audio files with NeMo Curator, you need to create a manifest file that lists your audio files and their metadata. NeMo Curator does not provide automatic audio file discovery - you must create a JSONL manifest first.

Supported Audio Formats

NeMo Curator supports audio formats compatible with the soundfile library:

FormatExtensionDescriptionRecommended Use
WAV.wavUncompressed, high qualityASR training, high-quality datasets
FLAC.flacLossless compressionArchival, high-quality with compression
MP3.mp3Compressed formatWeb content, podcasts
OGG.oggOpen-source compressionGeneral purpose

MP3 (.mp3) support depends on your system’s libsndfile build. For the most reliable behavior across environments, prefer WAV (.wav) or FLAC (.flac) formats.

Creating Manifests for Local Files

Basic Manifest Creation

Create a JSONL manifest file that lists your local audio files:

1import os
2import json
3
4def create_audio_manifest(audio_dir: str, manifest_path: str):
5 """Create a basic manifest for local audio files."""
6
7 manifest_entries = []
8
9 # Find all audio files in directory
10 for filename in os.listdir(audio_dir):
11 if filename.endswith(('.wav', '.flac', '.mp3', '.ogg')):
12 audio_path = os.path.abspath(os.path.join(audio_dir, filename))
13
14 # Basic entry - ASR will generate transcriptions
15 entry = {
16 "audio_filepath": audio_path,
17 "text": "" # Empty - will be filled by ASR inference based on pred_text_key
18 }
19 manifest_entries.append(entry)
20
21 # Write manifest file
22 with open(manifest_path, 'w') as f:
23 for entry in manifest_entries:
24 f.write(json.dumps(entry) + '\n')
25
26 print(f"Created manifest with {len(manifest_entries)} entries: {manifest_path}")
27
28# Usage
29create_audio_manifest(
30 audio_dir="/path/to/your/audio/files",
31 manifest_path="local_audio_manifest.jsonl"
32)

Directory Organization Examples

Paired Audio-Text Files:

/data/my_speech/
├── sample_001.wav
├── sample_001.txt
├── sample_002.wav
├── sample_002.txt
└── ...

Separated Directories:

/data/my_speech/
├── audio/
│ ├── sample_001.wav
│ ├── sample_002.wav
│ └── ...
└── transcripts/
├── sample_001.txt
├── sample_002.txt
└── ...

Processing Local Audio with Manifest

After creating your manifest, process it with NeMo Curator:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
4from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
5from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
6from nemo_curator.stages.text.io.writer import JsonlWriter
7
8def process_local_audio_manifest(manifest_path: str, output_dir: str):
9 """Process local audio files using a manifest."""
10
11 pipeline = Pipeline(name="local_audio_processing")
12
13 # Load manifest
14 pipeline.add_stage(JsonlReader(file_paths=manifest_path))
15
16 # ASR processing
17 pipeline.add_stage(InferenceAsrNemoStage(
18 model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
19 ))
20
21 # Calculate duration and filter
22 pipeline.add_stage(GetAudioDurationStage(
23 audio_filepath_key="audio_filepath",
24 duration_key="duration"
25 ))
26
27 # Keep files between 1-30 seconds
28 pipeline.add_stage(PreserveByValueStage(
29 input_value_key="duration",
30 target_value=1.0,
31 operator="ge"
32 ))
33 pipeline.add_stage(PreserveByValueStage(
34 input_value_key="duration",
35 target_value=30.0,
36 operator="le"
37 ))
38
39 # Export results
40 pipeline.add_stage(AudioToDocumentStage())
41 pipeline.add_stage(JsonlWriter(path=output_dir))
42
43 # Execute pipeline
44 pipeline.run()
45
46# Usage: First create manifest, then process
47create_audio_manifest("/path/to/audio/files", "local_manifest.jsonl")
48process_local_audio_manifest("local_manifest.jsonl", "/output/processed_audio")

Manifest with Existing Transcriptions

If you have existing transcription files, include them in your manifest:

1import os
2import json
3
4def create_manifest_with_transcripts(audio_dir: str, transcript_dir: str, manifest_path: str):
5 """Create manifest pairing audio files with existing transcriptions."""
6
7 manifest_entries = []
8
9 for filename in os.listdir(audio_dir):
10 if filename.endswith(('.wav', '.flac', '.mp3', '.ogg')):
11 # Find matching transcript file
12 base_name = os.path.splitext(filename)[0]
13 transcript_file = os.path.join(transcript_dir, f"{base_name}.txt")
14
15 audio_path = os.path.abspath(os.path.join(audio_dir, filename))
16
17 # Read transcript if it exists
18 text = ""
19 if os.path.exists(transcript_file):
20 with open(transcript_file, 'r', encoding='utf-8') as f:
21 text = f.read().strip()
22
23 entry = {
24 "audio_filepath": audio_path,
25 "text": text
26 }
27 manifest_entries.append(entry)
28
29 # Write manifest
30 with open(manifest_path, 'w') as f:
31 for entry in manifest_entries:
32 f.write(json.dumps(entry) + '\n')
33
34 print(f"Created manifest with {len(manifest_entries)} entries: {manifest_path}")
35
36# Usage
37create_manifest_with_transcripts(
38 audio_dir="/path/to/audio/files",
39 transcript_dir="/path/to/transcripts",
40 manifest_path="paired_manifest.jsonl"
41)

Best Practices

Organize Your Files

Structure your audio files for easy manifest creation:

/data/my_speech_project/
├── audio/
│ ├── sample_001.wav
│ ├── sample_002.wav
│ └── ...
├── transcripts/ (optional)
│ ├── sample_001.txt
│ ├── sample_002.txt
│ └── ...
└── manifest.jsonl (generated)

Validate File Paths

Ensure all audio files exist before processing:

1import os
2
3def validate_manifest(manifest_path: str):
4 """Check that all audio files in manifest exist."""
5
6 missing_files = []
7 valid_count = 0
8
9 with open(manifest_path, 'r') as f:
10 for line_num, line in enumerate(f, 1):
11 entry = json.loads(line.strip())
12 audio_path = entry.get("audio_filepath", "")
13
14 if not os.path.exists(audio_path):
15 missing_files.append(f"Line {line_num}: {audio_path}")
16 else:
17 valid_count += 1
18
19 if missing_files:
20 print(f"Warning: {len(missing_files)} missing files:")
21 for missing in missing_files[:5]: # Show first 5
22 print(f" {missing}")
23
24 print(f"Validation complete: {valid_count} valid files")
25 return len(missing_files) == 0
26
27# Validate before processing
28if validate_manifest("local_manifest.jsonl"):
29 process_local_audio_manifest("local_manifest.jsonl", "/output")