About NeMo CuratorConceptsAudio Concepts

Dataset Manifests and Ingest

View as Markdown

This guide covers the core concepts for ingesting audio data into NeMo Curator using consistent manifests and validation workflows.

Manifest Structure

Audio manifests in NeMo Curator follow a standardized format for consistent data processing:

Required Fields:

  • audio_filepath: Path to the audio file (absolute or relative)

Common Optional Fields:

  • text: Ground truth transcription or existing transcription
  • duration: Audio length in seconds
  • language: Language code (such as “en”, “es”, “fr”)
  • speaker_id: Speaker identifier for multi-speaker datasets
  • Custom metadata fields for domain-specific information

Creation Methods:

  • Programmatic Generation: Use dataset-specific stages like CreateInitialManifestFleursStage
  • Custom Scripts: Generate JSONL files with consistent field naming
  • Manual Creation: Create JSONL manifests for small datasets or specialized use cases

Data Ingestion and Validation

NeMo Curator provides robust validation mechanisms for audio data ingestion:

File Existence Validation:

  • AudioBatch automatically validates file paths during creation
  • Use validate() for batch-level validation
  • Use validate_item() for individual file validation
  • Missing files generate warnings but do not stop processing

Validation Strategy:

  • Check file existence at the start of the pipeline
  • Add metadata fields (duration, format) in downstream processing stages
  • Use non-blocking validation to maintain processing throughput

Field Recommendations

Essential for All Workflows:

  • audio_filepath: File path validation and processing

Recommended for ASR Workflows:

  • text: Ground truth for WER calculation and quality assessment
  • language: Language-specific model selection and validation

Recommended for Quality Assessment:

  • duration: Duration-based filtering and speech rate analysis
  • speaker_id: Speaker consistency and diversity analysis

Domain-Specific Fields:

  • Recording quality indicators (studio, phone, outdoor)
  • Content type tags (conversational, broadcast, lecture)
  • Noise level indicators for quality assessment

Implementation Examples

Basic Manifest Creation:

1import json
2
3# Create simple manifest
4manifest_data = [
5 {
6 "audio_filepath": "/path/to/audio1.wav",
7 "text": "Hello world",
8 "duration": 1.5,
9 "language": "en"
10 },
11 {
12 "audio_filepath": "/path/to/audio2.wav",
13 "text": "Good morning",
14 "duration": 2.1,
15 "language": "en"
16 }
17]
18
19# Save as JSONL
20with open("manifest.jsonl", "w") as f:
21 for item in manifest_data:
22 f.write(json.dumps(item) + "\n")

AudioBatch Validation:

1from nemo_curator.tasks import AudioBatch
2
3# Create AudioBatch with validation
4audio_batch = AudioBatch(
5 data=manifest_data,
6 filepath_key="audio_filepath"
7)
8
9# Validate file existence
10is_valid = audio_batch.validate()
11print(f"Batch validation: {is_valid}")

Pipeline Integration

ASR Workflow Preparation:

  • Ensure audio_filepath points to valid audio files
  • ASR stages automatically add pred_text field with predictions
  • Include text field for WER calculation and quality assessment

Quality Assessment Preparation:

  • Use GetAudioDurationStage to add duration information
  • Include existing transcriptions for WER-based filtering
  • Add metadata fields for comprehensive quality analysis

Format Conversion Readiness:

  • Standardize field names across different data sources
  • Ensure consistent audio file formats and sample rates
  • Validate encoding and accessibility of all audio files