About NeMo CuratorConceptsAudio Concepts

Dataset Manifests and Ingest

View as Markdown

This guide covers the core concepts for ingesting audio data into NeMo Curator using consistent manifests and validation workflows.

Manifest Structure

Audio manifests in NeMo Curator follow a standardized format for consistent data processing:

Required Fields:

  • audio_filepath: Path to the audio file (absolute or relative)

Common Optional Fields:

  • text: Ground truth transcription or existing transcription
  • duration: Audio length in seconds
  • language: Language code (such as “en”, “es”, “fr”)
  • speaker_id: Speaker identifier for multi-speaker datasets
  • Custom metadata fields for domain-specific information

Creation Methods:

  • Programmatic Generation: Use dataset-specific stages like CreateInitialManifestFleursStage
  • Custom Scripts: Generate JSONL files with consistent field naming
  • Manual Creation: Create JSONL manifests for small datasets or specialized use cases

Data Ingestion and Validation

NeMo Curator provides robust validation mechanisms for audio data ingestion:

File Existence Validation:

  • AudioTask automatically validates file paths during creation
  • Use validate() to check whether the audio file for this task exists on disk
  • Use validate_item() for individual file validation
  • Missing files generate warnings but do not stop processing

Validation Strategy:

  • Check file existence at the start of the pipeline
  • Add metadata fields (duration, format) in downstream processing stages
  • Use non-blocking validation to maintain processing throughput

Field Recommendations

Essential for All Workflows:

  • audio_filepath: File path validation and processing

Recommended for ASR Workflows:

  • text: Ground truth for WER calculation and quality assessment
  • language: Language-specific model selection and validation

Recommended for Quality Assessment:

  • duration: Duration-based filtering and speech rate analysis
  • speaker_id: Speaker consistency and diversity analysis

Domain-Specific Fields:

  • Recording quality indicators (studio, phone, outdoor)
  • Content type tags (conversational, broadcast, lecture)
  • Noise level indicators for quality assessment

Implementation Examples

Basic Manifest Creation:

1import json
2
3# Create simple manifest
4manifest_data = [
5 {
6 "audio_filepath": "/path/to/audio1.wav",
7 "text": "Hello world",
8 "duration": 1.5,
9 "language": "en"
10 },
11 {
12 "audio_filepath": "/path/to/audio2.wav",
13 "text": "Good morning",
14 "duration": 2.1,
15 "language": "en"
16 }
17]
18
19# Save as JSONL
20with open("manifest.jsonl", "w") as f:
21 for item in manifest_data:
22 f.write(json.dumps(item) + "\n")

AudioTask Validation:

1from nemo_curator.tasks import AudioTask
2
3# Create one AudioTask per manifest entry and validate
4for entry in manifest_data:
5 audio_task = AudioTask(data=entry, filepath_key="audio_filepath")
6 is_valid = audio_task.validate()
7 print(f"Task validation: {is_valid}")

Pipeline Integration

ASR Workflow Preparation:

  • Ensure audio_filepath points to valid audio files
  • ASR stages automatically add pred_text field with predictions
  • Include text field for WER calculation and quality assessment

Quality Assessment Preparation:

  • Use GetAudioDurationStage to add duration information
  • Include existing transcriptions for WER-based filtering
  • Add metadata fields for comprehensive quality analysis

Format Conversion Readiness:

  • Standardize field names across different data sources
  • Ensure consistent audio file formats and sample rates
  • Validate encoding and accessibility of all audio files