> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

> Concepts for constructing manifests and ingesting audio datasets in NeMo Curator

# Dataset Manifests and Ingest

This guide covers the core concepts for ingesting audio data into NeMo Curator using consistent manifests and validation workflows.

## Manifest Structure

Audio manifests in NeMo Curator follow a standardized format for consistent data processing:

**Required Fields**:

* `audio_filepath`: Path to the audio file (absolute or relative)

**Common Optional Fields**:

* `text`: Ground truth transcription or existing transcription
* `duration`: Audio length in seconds
* `language`: Language code (such as "en", "es", "fr")
* `speaker_id`: Speaker identifier for multi-speaker datasets
* Custom metadata fields for domain-specific information

**Creation Methods**:

* **Programmatic Generation**: Use dataset-specific stages like `CreateInitialManifestFleursStage`
* **Custom Scripts**: Generate JSONL files with consistent field naming
* **Manual Creation**: Create JSONL manifests for small datasets or specialized use cases

## Data Ingestion and Validation

NeMo Curator provides robust validation mechanisms for audio data ingestion:

**File Existence Validation**:

* `AudioTask` automatically validates file paths during creation
* Use `validate()` to check whether the audio file for this task exists on disk
* Use `validate_item()` for individual file validation
* Missing files generate warnings but do not stop processing

**Validation Strategy**:

* Check file existence at the start of the pipeline
* Add metadata fields (duration, format) in downstream processing stages
* Use non-blocking validation to maintain processing throughput

## Field Recommendations

**Essential for All Workflows**:

* `audio_filepath`: File path validation and processing

**Recommended for ASR Workflows**:

* `text`: Ground truth for WER calculation and quality assessment
* `language`: Language-specific model selection and validation

**Recommended for Quality Assessment**:

* `duration`: Duration-based filtering and speech rate analysis
* `speaker_id`: Speaker consistency and diversity analysis

**Domain-Specific Fields**:

* Recording quality indicators (studio, phone, outdoor)
* Content type tags (conversational, broadcast, lecture)
* Noise level indicators for quality assessment

## Implementation Examples

**Basic Manifest Creation**:

```python
import json

# Create simple manifest
manifest_data = [
    {
        "audio_filepath": "/path/to/audio1.wav",
        "text": "Hello world",
        "duration": 1.5,
        "language": "en"
    },
    {
        "audio_filepath": "/path/to/audio2.wav", 
        "text": "Good morning",
        "duration": 2.1,
        "language": "en"
    }
]

# Save as JSONL
with open("manifest.jsonl", "w") as f:
    for item in manifest_data:
        f.write(json.dumps(item) + "\n")
```

**AudioTask Validation**:

```python
from nemo_curator.tasks import AudioTask

# Create one AudioTask per manifest entry and validate
for entry in manifest_data:
    audio_task = AudioTask(data=entry, filepath_key="audio_filepath")
    is_valid = audio_task.validate()
    print(f"Task validation: {is_valid}")
```

## Pipeline Integration

**ASR Workflow Preparation**:

* Ensure `audio_filepath` points to valid audio files
* ASR stages automatically add `pred_text` field with predictions
* Include `text` field for WER calculation and quality assessment

**Quality Assessment Preparation**:

* Use `GetAudioDurationStage` to add duration information
* Include existing transcriptions for WER-based filtering
* Add metadata fields for comprehensive quality analysis

**Format Conversion Readiness**:

* Standardize field names across different data sources
* Ensure consistent audio file formats and sample rates
* Validate encoding and accessibility of all audio files