Save & Export Audio Data#
Export processed audio data and transcriptions in formats optimized for ASR model training, speech-to-text applications, and downstream analysis workflows.
Overview#
After processing your audio data through NeMo Curator’s pipeline, export the results in standardized formats suitable for:
ASR Model Training: JSONL manifests with audio file paths and transcriptions for NeMo ASR training
Quality Analysis: Datasets with WER, duration, and other metrics for evaluation
Dataset Distribution: Curated audio datasets with metadata for sharing or archiving
Downstream Processing: Structured data for integration with other tools and workflows
Output Formats#
NeMo Curator’s audio curation pipeline supports JSONL (JSON Lines) format, the standard for NeMo ASR training and audio dataset distribution.
JSONL Manifests#
The primary output format for audio curation is JSONL (JSON Lines), where each line represents one audio sample:
{"audio_filepath": "/data/audio/sample_001.wav", "text": "hello world", "pred_text": "hello world", "wer": 0.0, "duration": 2.1}
{"audio_filepath": "/data/audio/sample_002.wav", "text": "good morning", "pred_text": "good morning", "wer": 0.0, "duration": 1.8}
Format characteristics:
One JSON object per line (newline-delimited)
Human-readable and machine-parseable
Compatible with NeMo ASR training pipelines
Easy to process with standard tools (jq, pandas, etc.)
Metadata Fields#
Standard fields included in audio manifests:
Field |
Type |
Description |
|---|---|---|
|
string |
Absolute path to audio file |
|
string |
Ground truth transcription |
|
string |
ASR model prediction |
|
float |
Word Error Rate percentage |
|
float |
Audio duration in seconds |
|
string |
Language identifier (optional) |
Note
Fields marked as “optional” depend on which processing stages you included in your pipeline. At minimum, manifests require audio_filepath and either text (for ground truth) or pred_text (for ASR predictions).
Export Configuration#
To export audio curation results, you must first convert AudioBatch to DocumentBatch format, then use JsonlWriter:
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
# Convert AudioBatch to DocumentBatch for text writer
pipeline.add_stage(AudioToDocumentStage())
# Configure JSONL export
pipeline.add_stage(
JsonlWriter(
path="/output/audio_manifests",
write_kwargs={"force_ascii": False} # Support Unicode characters
)
)
Parameters:
path: Output directory path (absolute or relative)write_kwargs: Optional dictionary passed to pandas.to_json()methodforce_ascii=False: Preserve Unicode characters (recommended for non-English languages)orient="records": Format (default for JSONL)lines=True: Write as JSONL (default)
Note: AudioToDocumentStage() is required before JsonlWriter because the writer operates on DocumentBatch objects, not AudioBatch objects.
Advanced Export Options
Customize the export behavior with additional parameters:
# Example: Custom JSON formatting
pipeline.add_stage(
JsonlWriter(
path="/output/audio_manifests",
write_kwargs={
"force_ascii": False, # Preserve Unicode
"indent": None, # No indentation (compact)
"ensure_ascii": False # Allow non-ASCII characters
}
)
)
Directory Structure#
Standard Output Layout#
The JsonlWriter creates output files in the specified directory:
/output/audio_manifests/
├── <hash>.jsonl # Deterministic hash if metadata.source_files present, else UUID
├── <hash>.jsonl
└── ...
File naming:
Files are named using deterministic hashes based on partition metadata when available
File names are generated automatically; you cannot specify individual file names
Multiple output files may be created depending on data partitioning
File content: Each JSONL file contains one or more audio records, with one JSON object per line.
Quality Control#
Pre-Export Validation#
Apply quality filters before exporting to ensure your output dataset meets minimum standards:
from nemo_curator.stages.audio.common import PreserveByValueStage
# Filter by quality thresholds
quality_filters = [
# Keep samples with WER <= 30%
PreserveByValueStage(
input_value_key="wer",
target_value=30.0,
operator="le"
),
# Keep samples with duration 0.1-20.0 seconds
PreserveByValueStage(
input_value_key="duration",
target_value=0.1,
operator="ge"
),
PreserveByValueStage(
input_value_key="duration",
target_value=20.0,
operator="le"
)
]
# Add quality filters before conversion and export
for filter_stage in quality_filters:
pipeline.add_stage(filter_stage)
# Then convert and export
pipeline.add_stage(AudioToDocumentStage())
pipeline.add_stage(JsonlWriter(path="/output/high_quality_audio"))
Recommended validation steps:
WER filtering: Remove samples with poor transcription accuracy
Duration filtering: Exclude samples that are too short or too long
Completeness check: Ensure required fields (
audio_filepath,text) are presentPath validation: Verify audio file paths are accessible for training
Complete Export Example#
Here’s a complete pipeline demonstrating audio processing and export:
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.resources import Resources
# Create audio curation pipeline with export
pipeline = Pipeline(name="audio_curation_with_export")
# 1. Load audio data
pipeline.add_stage(CreateInitialManifestFleursStage(
lang="en_us",
split="validation",
raw_data_dir="./audio_data"
).with_(batch_size=8))
# 2. Run ASR inference
pipeline.add_stage(InferenceAsrNemoStage(
model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
pred_text_key="pred_text"
).with_(resources=Resources(gpus=1.0)))
# 3. Calculate quality metrics
pipeline.add_stage(GetPairwiseWerStage(
text_key="text",
pred_text_key="pred_text",
wer_key="wer"
))
pipeline.add_stage(GetAudioDurationStage(
audio_filepath_key="audio_filepath",
duration_key="duration"
))
# 4. Apply quality filters
pipeline.add_stage(PreserveByValueStage(
input_value_key="wer",
target_value=30.0,
operator="le" # Keep WER <= 30%
))
pipeline.add_stage(PreserveByValueStage(
input_value_key="duration",
target_value=0.1,
operator="ge" # Keep duration >= 0.1s
))
pipeline.add_stage(PreserveByValueStage(
input_value_key="duration",
target_value=20.0,
operator="le" # Keep duration <= 20s
))
# 5. Convert to DocumentBatch and export
pipeline.add_stage(AudioToDocumentStage())
pipeline.add_stage(JsonlWriter(
path="./curated_audio_dataset",
write_kwargs={"force_ascii": False}
))
# Execute pipeline
executor = XennaExecutor()
pipeline.run(executor)
print("Audio curation complete. Results saved to ./curated_audio_dataset/")
Expected output:
JSONL files in
./curated_audio_dataset/directoryEach file contains filtered, high-quality audio samples
All samples have WER ≤ 30% and duration between 0.1-20.0 seconds
Best Practices#
Use absolute paths: For
audio_filepath, use absolute paths to ensure audio files are accessible during trainingValidate before export: Apply quality filters before conversion to reduce output size and improve dataset quality
Set appropriate thresholds: Adjust WER and duration thresholds based on your specific use case and domain
Preserve metadata: Include all relevant fields (WER, duration, language) for future analysis and filtering
Test with small batches: Run pipeline on a small subset first to verify output format and quality
Document your filters: Keep track of quality thresholds used for reproducibility