***
description: >-
Export processed audio data and transcriptions in formats optimized for ASR
training and multimodal applications
categories:
* data-export
tags:
* output-formats
* manifests
* jsonl
* metadata
* asr-training
personas:
* data-scientist-focused
* mle-focused
difficulty: beginner
content\_type: how-to
modality: audio-only
***
# Save & Export Audio Data
Export processed audio data and transcriptions in formats optimized for ASR model training, audio-and-text applications, and downstream analysis workflows.
## Output Formats
NeMo Curator's audio curation pipeline supports several output formats tailored for different use cases:
### JSONL Manifests
The primary output format for audio curation is JSONL (JSON Lines):
```json
{"audio_filepath": "/data/audio/sample_001.wav", "text": "hello world", "pred_text": "hello world", "wer": 0.0, "duration": 2.1}
{"audio_filepath": "/data/audio/sample_002.wav", "text": "good morning", "pred_text": "good morning", "wer": 0.0, "duration": 1.8}
```
### Metadata Fields
Standard fields included in audio manifests:
| Field | Type | Description |
| ---------------- | ------ | ------------------------------ |
| `audio_filepath` | string | Absolute path to audio file |
| `text` | string | Ground truth transcription |
| `pred_text` | string | ASR model prediction |
| `wer` | float | Word Error Rate percentage |
| `duration` | float | Audio duration in seconds |
| `language` | string | Language identifier (optional) |
## Export Configuration
```python
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
# Convert AudioBatch to DocumentBatch for text writer
pipeline.add_stage(AudioToDocumentStage())
# Configure JSONL export
pipeline.add_stage(
JsonlWriter(
path="/output/audio_manifests",
write_kwargs={"force_ascii": False} # Support Unicode characters
)
)
```
## Directory Structure
### Standard Output Layout
When `source_files` metadata exists, the writer generates deterministic hashed file names. Otherwise, it generates UUID-based names.
```text
/output/audio_manifests/
├── .jsonl # Deterministic hash if metadata.source_files present, else UUID
├── .jsonl
└── ...
```
## Quality Control
### Validation Checks
Before export, check your processed data:
```python
from nemo_curator.stages.audio.common import PreserveByValueStage
# Filter by quality thresholds
quality_filters = [
# Keep samples with WER <= 50%
PreserveByValueStage(
input_value_key="wer",
target_value=50.0,
operator="le"
),
# Keep samples with duration 1-30 seconds
PreserveByValueStage(
input_value_key="duration",
target_value=1.0,
operator="ge"
),
PreserveByValueStage(
input_value_key="duration",
target_value=30.0,
operator="le"
)
]
for filter_stage in quality_filters:
pipeline.add_stage(filter_stage)
```