Curate AudioLoad Data

Create and Load Custom Audio Manifests

View as Markdown

Create and load custom audio manifests in JSONL format for your speech datasets. This guide covers the required manifest format and how to load manifests into NeMo Curator pipelines.

Manifest Format

NeMo Curator uses JSONL (JSON Lines) format for audio manifests, with one JSON object per line:

1{"audio_filepath": "/data/audio/sample_001.wav", "text": "hello world", "duration": 2.1}
2{"audio_filepath": "/data/audio/sample_002.wav", "text": "good morning", "duration": 1.8}
3{"audio_filepath": "/data/audio/sample_003.wav", "text": "how are you", "duration": 2.3}

NeMo Curator does not provide a generic TSV reader stage. You must convert your data to JSONL format before loading, or use dataset-specific importers like the FLEURS manifest creator.

Required Fields

Every audio manifest entry must include:

FieldTypeDescriptionExample
audio_filepathstringAbsolute or relative path to audio file/data/audio/sample.wav
textstringGround truth transcription"hello world"

Optional Fields

Additional fields that can enhance processing:

FieldTypeDescriptionExample
durationfloatAudio duration in seconds2.1
languagestringLanguage identifier"en_us"
speaker_idstringSpeaker identifier"speaker_001"
sample_rateintAudio sample rate in Hz16000

Creating Custom Manifests

You’ll need to create your own manifest files using your preferred tools. Here’s a simple Python example:

1import json
2
3# Example: Create a manifest from a list of audio files
4audio_data = [
5 {"audio_filepath": "/data/audio/sample1.wav", "text": "hello world"},
6 {"audio_filepath": "/data/audio/sample2.wav", "text": "good morning"},
7 {"audio_filepath": "/data/audio/sample3.wav", "text": "how are you"}
8]
9
10# Write JSONL manifest
11with open("my_audio_manifest.jsonl", "w") as f:
12 for entry in audio_data:
13 f.write(json.dumps(entry) + "\n")

Loading Manifests in Pipelines

Using JsonlReader

Load your custom manifest using the built-in JsonlReader:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
4from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
5
6# Create pipeline
7pipeline = Pipeline(name="custom_audio_processing")
8
9# Load custom manifest (produces DocumentBatch)
10pipeline.add_stage(
11 JsonlReader(
12 file_paths="my_audio_manifest.jsonl",
13 fields=["audio_filepath", "text"]
14 )
15)
16
17# ASR inference (consumes DocumentBatch, produces AudioBatch)
18pipeline.add_stage(
19 InferenceAsrNemoStage(
20 model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
21 filepath_key="audio_filepath",
22 pred_text_key="pred_text"
23 )
24)
25
26# Calculate WER between ground truth and prediction
27pipeline.add_stage(
28 GetPairwiseWerStage(
29 text_key="text",
30 pred_text_key="pred_text",
31 wer_key="wer"
32 )
33)

Validation

Audio file validation happens automatically during pipeline processing:

1# Validation occurs when stages process AudioBatch objects
2# Files are checked for existence during ASR inference
3# Invalid files generate warnings but don't stop processing
4
5# Pipeline stages handle validation automatically
6if audio_batch.validate():
7 print("All audio files exist")
8else:
9 print("Some audio files are missing")

Example: Complete Workflow

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
4from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
5from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
6from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
7from nemo_curator.stages.text.io.writer import JsonlWriter
8
9def create_custom_audio_pipeline(manifest_path: str, output_path: str) -> Pipeline:
10 """Create pipeline for custom audio manifest processing."""
11
12 pipeline = Pipeline(name="custom_audio_processing")
13
14 # Load custom manifest
15 pipeline.add_stage(JsonlReader(
16 file_paths=manifest_path,
17 fields=["audio_filepath", "text"]
18 ))
19
20 # ASR processing
21 pipeline.add_stage(InferenceAsrNemoStage(
22 model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
23 ))
24
25 # Quality assessment
26 pipeline.add_stage(GetPairwiseWerStage())
27 pipeline.add_stage(GetAudioDurationStage(
28 audio_filepath_key="audio_filepath",
29 duration_key="duration"
30 ))
31
32 # Filter by quality (keep WER <= 40%)
33 pipeline.add_stage(PreserveByValueStage(
34 input_value_key="wer",
35 target_value=40.0,
36 operator="le"
37 ))
38
39 # Export results
40 pipeline.add_stage(AudioToDocumentStage())
41 pipeline.add_stage(JsonlWriter(path=output_path))
42
43 return pipeline
44
45# Usage
46pipeline = create_custom_audio_pipeline(
47 manifest_path="my_audio_manifest.jsonl",
48 output_path="processed_results"
49)