Dataset Manifests and Ingest#
This guide covers the core concepts for ingesting audio data into NeMo Curator using consistent manifests and validation workflows.
Manifest Structure#
Audio manifests in NeMo Curator follow a standardized format for consistent data processing:
Required Fields:
audio_filepath
: Path to the audio file (absolute or relative)
Common Optional Fields:
text
: Ground truth transcription or existing transcriptionduration
: Audio length in secondslanguage
: Language code (such as “en”, “es”, “fr”)speaker_id
: Speaker identifier for multi-speaker datasetsCustom metadata fields for domain-specific information
Creation Methods:
Programmatic Generation: Use dataset-specific stages like
CreateInitialManifestFleursStage
Custom Scripts: Generate JSONL files with consistent field naming
Manual Creation: Create JSONL manifests for small datasets or specialized use cases
Data Ingestion and Validation#
NeMo Curator provides robust validation mechanisms for audio data ingestion:
File Existence Validation:
AudioBatch
automatically validates file paths during creationUse
validate()
for batch-level validationUse
validate_item()
for individual file validationMissing files generate warnings but do not stop processing
Validation Strategy:
Check file existence at the start of the pipeline
Add metadata fields (duration, format) in downstream processing stages
Use non-blocking validation to maintain processing throughput
Field Recommendations#
Essential for All Workflows:
audio_filepath
: File path validation and processing
Recommended for ASR Workflows:
text
: Ground truth for WER calculation and quality assessmentlanguage
: Language-specific model selection and validation
Recommended for Quality Assessment:
duration
: Duration-based filtering and speech rate analysisspeaker_id
: Speaker consistency and diversity analysis
Domain-Specific Fields:
Recording quality indicators (studio, phone, outdoor)
Content type tags (conversational, broadcast, lecture)
Noise level indicators for quality assessment
Implementation Examples#
Basic Manifest Creation:
import json
# Create simple manifest
manifest_data = [
{
"audio_filepath": "/path/to/audio1.wav",
"text": "Hello world",
"duration": 1.5,
"language": "en"
},
{
"audio_filepath": "/path/to/audio2.wav",
"text": "Good morning",
"duration": 2.1,
"language": "en"
}
]
# Save as JSONL
with open("manifest.jsonl", "w") as f:
for item in manifest_data:
f.write(json.dumps(item) + "\n")
AudioBatch Validation:
from nemo_curator.tasks import AudioBatch
# Create AudioBatch with validation
audio_batch = AudioBatch(
data=manifest_data,
filepath_key="audio_filepath"
)
# Validate file existence
is_valid = audio_batch.validate()
print(f"Batch validation: {is_valid}")
Pipeline Integration#
ASR Workflow Preparation:
Ensure
audio_filepath
points to valid audio filesASR stages automatically add
pred_text
field with predictionsInclude
text
field for WER calculation and quality assessment
Quality Assessment Preparation:
Use
GetAudioDurationStage
to add duration informationInclude existing transcriptions for WER-based filtering
Add metadata fields for comprehensive quality analysis
Format Conversion Readiness:
Standardize field names across different data sources
Ensure consistent audio file formats and sample rates
Validate encoding and accessibility of all audio files