Dataset Manifests and Ingest
This guide covers the core concepts for ingesting audio data into NeMo Curator using consistent manifests and validation workflows.
Manifest Structure
Audio manifests in NeMo Curator follow a standardized format for consistent data processing:
Required Fields:
audio_filepath: Path to the audio file (absolute or relative)
Common Optional Fields:
text: Ground truth transcription or existing transcriptionduration: Audio length in secondslanguage: Language code (such as “en”, “es”, “fr”)speaker_id: Speaker identifier for multi-speaker datasets- Custom metadata fields for domain-specific information
Creation Methods:
- Programmatic Generation: Use dataset-specific stages like
CreateInitialManifestFleursStage - Custom Scripts: Generate JSONL files with consistent field naming
- Manual Creation: Create JSONL manifests for small datasets or specialized use cases
Data Ingestion and Validation
NeMo Curator provides robust validation mechanisms for audio data ingestion:
File Existence Validation:
AudioBatchautomatically validates file paths during creation- Use
validate()for batch-level validation - Use
validate_item()for individual file validation - Missing files generate warnings but do not stop processing
Validation Strategy:
- Check file existence at the start of the pipeline
- Add metadata fields (duration, format) in downstream processing stages
- Use non-blocking validation to maintain processing throughput
Field Recommendations
Essential for All Workflows:
audio_filepath: File path validation and processing
Recommended for ASR Workflows:
text: Ground truth for WER calculation and quality assessmentlanguage: Language-specific model selection and validation
Recommended for Quality Assessment:
duration: Duration-based filtering and speech rate analysisspeaker_id: Speaker consistency and diversity analysis
Domain-Specific Fields:
- Recording quality indicators (studio, phone, outdoor)
- Content type tags (conversational, broadcast, lecture)
- Noise level indicators for quality assessment
Implementation Examples
Basic Manifest Creation:
AudioBatch Validation:
Pipeline Integration
ASR Workflow Preparation:
- Ensure
audio_filepathpoints to valid audio files - ASR stages automatically add
pred_textfield with predictions - Include
textfield for WER calculation and quality assessment
Quality Assessment Preparation:
- Use
GetAudioDurationStageto add duration information - Include existing transcriptions for WER-based filtering
- Add metadata fields for comprehensive quality analysis
Format Conversion Readiness:
- Standardize field names across different data sources
- Ensure consistent audio file formats and sample rates
- Validate encoding and accessibility of all audio files