*** description: >- Concepts for constructing manifests and ingesting audio datasets in NeMo Curator categories: * concepts-architecture tags: * manifests * ingest * datasets * audio personas: * data-scientist-focused * mle-focused difficulty: beginner content\_type: concept modality: audio-only *** # Dataset Manifests and Ingest This guide covers the core concepts for ingesting audio data into NeMo Curator using consistent manifests and validation workflows. ## Manifest Structure Audio manifests in NeMo Curator follow a standardized format for consistent data processing: **Required Fields**: * `audio_filepath`: Path to the audio file (absolute or relative) **Common Optional Fields**: * `text`: Ground truth transcription or existing transcription * `duration`: Audio length in seconds * `language`: Language code (such as "en", "es", "fr") * `speaker_id`: Speaker identifier for multi-speaker datasets * Custom metadata fields for domain-specific information **Creation Methods**: * **Programmatic Generation**: Use dataset-specific stages like `CreateInitialManifestFleursStage` * **Custom Scripts**: Generate JSONL files with consistent field naming * **Manual Creation**: Create JSONL manifests for small datasets or specialized use cases ## Data Ingestion and Validation NeMo Curator provides robust validation mechanisms for audio data ingestion: **File Existence Validation**: * `AudioBatch` automatically validates file paths during creation * Use `validate()` for batch-level validation * Use `validate_item()` for individual file validation * Missing files generate warnings but do not stop processing **Validation Strategy**: * Check file existence at the start of the pipeline * Add metadata fields (duration, format) in downstream processing stages * Use non-blocking validation to maintain processing throughput ## Field Recommendations **Essential for All Workflows**: * `audio_filepath`: File path validation and processing **Recommended for ASR Workflows**: * `text`: Ground truth for WER calculation and quality assessment * `language`: Language-specific model selection and validation **Recommended for Quality Assessment**: * `duration`: Duration-based filtering and speech rate analysis * `speaker_id`: Speaker consistency and diversity analysis **Domain-Specific Fields**: * Recording quality indicators (studio, phone, outdoor) * Content type tags (conversational, broadcast, lecture) * Noise level indicators for quality assessment ## Implementation Examples **Basic Manifest Creation**: ```python import json # Create simple manifest manifest_data = [ { "audio_filepath": "/path/to/audio1.wav", "text": "Hello world", "duration": 1.5, "language": "en" }, { "audio_filepath": "/path/to/audio2.wav", "text": "Good morning", "duration": 2.1, "language": "en" } ] # Save as JSONL with open("manifest.jsonl", "w") as f: for item in manifest_data: f.write(json.dumps(item) + "\n") ``` **AudioBatch Validation**: ```python from nemo_curator.tasks import AudioBatch # Create AudioBatch with validation audio_batch = AudioBatch( data=manifest_data, filepath_key="audio_filepath" ) # Validate file existence is_valid = audio_batch.validate() print(f"Batch validation: {is_valid}") ``` ## Pipeline Integration **ASR Workflow Preparation**: * Ensure `audio_filepath` points to valid audio files * ASR stages automatically add `pred_text` field with predictions * Include `text` field for WER calculation and quality assessment **Quality Assessment Preparation**: * Use `GetAudioDurationStage` to add duration information * Include existing transcriptions for WER-based filtering * Add metadata fields for comprehensive quality analysis **Format Conversion Readiness**: * Standardize field names across different data sources * Ensure consistent audio file formats and sample rates * Validate encoding and accessibility of all audio files