***
description: Convert processed audio data to DocumentBatch format for downstream processing
categories:
* audio-processing
tags:
* format-conversion
* audio-to-text
* documentbatch
personas:
* data-scientist-focused
* mle-focused
difficulty: beginner
content\_type: how-to
modality: audio-text
***
# Text Integration for Audio Data
Convert processed audio data from `AudioBatch` to `DocumentBatch` format using the built-in `AudioToDocumentStage`. This enables you to export audio processing results or integrate with custom text processing workflows.
## How it Works
The `AudioToDocumentStage` provides straightforward format conversion between NeMo Curator's audio and text data structures:
1. **Format Conversion**: Transform `AudioBatch` objects to `DocumentBatch` format
2. **Metadata Preservation**: All fields from the audio data are preserved in the conversion
3. **Export Ready**: Convert audio processing results to pandas DataFrame format for analysis or export
**Common use cases:**
* Export ASR results and quality metrics for analysis
* Save filtered audio datasets with transcriptions
* Integrate audio processing outputs with downstream text workflows
## Basic Conversion
### AudioBatch to DocumentBatch
Use `AudioToDocumentStage` to convert audio processing results to document format:
```python
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.tasks import AudioBatch
# Convert audio data to DocumentBatch format
converter = AudioToDocumentStage()
# Input: AudioBatch with audio processing results
audio_batch = AudioBatch(data=[
{
"audio_filepath": "/data/audio/sample.wav",
"text": "ground truth text",
"pred_text": "asr predicted text",
"wer": 12.5,
"duration": 3.2
}
])
# Output: DocumentBatch with pandas DataFrame
document_batches = converter.process(audio_batch)
document_batch = document_batches[0]
# Access the converted data
print(f"Converted {len(document_batch.data)} audio records to DocumentBatch")
```
**Parameters:**
* `AudioToDocumentStage()` has no configuration parameters; it performs direct format conversion
**Returns:**
* List of `DocumentBatch` objects containing a pandas DataFrame with all original audio fields
### What Gets Preserved
The conversion preserves all fields from your audio processing pipeline:
```python
# All audio processing results are maintained:
# - audio_filepath: Original audio file reference
# - text: Ground truth transcription (if available)
# - pred_text: ASR prediction
# - wer: Word Error Rate (if calculated)
# - duration: Audio duration (if calculated)
# - Any other metadata fields you've added
```
Field names and values are preserved exactly as they appear in the `AudioBatch`. No data transformation or cleaning is performed during conversion.
## Integration in Pipelines
### Complete Audio Processing with Export
The most common use case is adding `AudioToDocumentStage` at the end of your audio pipeline to enable result export:
```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
from nemo_curator.stages.audio.common import GetAudioDurationStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.resources import Resources
# Create pipeline that processes audio and exports results
pipeline = Pipeline(name="audio_processing_with_export")
# 1. Load audio data
pipeline.add_stage(CreateInitialManifestFleursStage(
lang="en_us",
split="test",
raw_data_dir="./audio_data"
).with_(batch_size=8))
# 2. Run ASR inference
pipeline.add_stage(InferenceAsrNemoStage(
model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
pred_text_key="pred_text"
).with_(resources=Resources(gpus=1.0)))
# 3. Calculate quality metrics
pipeline.add_stage(GetPairwiseWerStage(
text_key="text",
pred_text_key="pred_text",
wer_key="wer"
))
pipeline.add_stage(GetAudioDurationStage(
audio_filepath_key="audio_filepath",
duration_key="duration"
))
# 4. Convert to DocumentBatch for export
pipeline.add_stage(AudioToDocumentStage())
# 5. Export to JSONL format
pipeline.add_stage(JsonlWriter(path="/output/processed_audio_results"))
# Execute pipeline
executor = XennaExecutor()
pipeline.run(executor)
```
**Output format:** The `JsonlWriter` creates a JSONL file where each line contains one audio sample with all fields:
```json
{"audio_filepath": "/data/audio/sample1.wav", "text": "hello world", "pred_text": "hello world", "wer": 0.0, "duration": 1.5}
{"audio_filepath": "/data/audio/sample2.wav", "text": "test audio", "pred_text": "test odio", "wer": 50.0, "duration": 2.1}
```
## Custom Integration
While `AudioToDocumentStage` converts audio data to `DocumentBatch` format, NeMo Curator's built-in text processing stages (filters, classifiers, etc.) are designed for text documents, not audio transcriptions. For audio-specific text processing, implement custom stages that operate on the converted `DocumentBatch` data.
### Example: Custom Text Processing
```python
from nemo_curator.stages.function_decorators import processing_stage
from nemo_curator.tasks import DocumentBatch
import pandas as pd
@processing_stage(name="custom_transcription_filter")
def filter_transcriptions(document_batch: DocumentBatch) -> DocumentBatch:
"""Custom filtering of ASR transcriptions."""
# Access the pandas DataFrame
df = document_batch.data
# Example: Filter by transcription length
df = df[df['pred_text'].str.len() >10] # Keep transcriptions >10 chars
# Example: Filter by WER if available
if 'wer' in df.columns:
df = df[df['wer'] < 50.0] # Keep WER < 50%
return DocumentBatch(
data=df,
task_id=document_batch.task_id,
dataset_name=document_batch.dataset_name
)
```
## Output Format
After conversion, your data will be in `DocumentBatch` format with a pandas DataFrame:
```python
# Example output structure
document_batch.data # pandas DataFrame with columns:
# - audio_filepath: "/path/to/audio.wav"
# - text: "ground truth transcription"
# - pred_text: "asr prediction"
# - wer: 15.2
# - duration: 3.4
# - [any other fields from your audio processing]
```
## Limitations
**Text Processing Integration**: NeMo Curator's text processing stages are designed for `DocumentBatch` inputs (text documents such as articles, web pages), but they are not designed for audio-derived transcriptions. You should implement custom processing stages for audio-specific workflows.
**Reasons for incompatibility:**
* Text filters assume document-level content (e.g., paragraph structure, word count thresholds designed for articles)
* ASR transcriptions have different characteristics (shorter, can contain recognition errors, conversational language)
* Audio-specific metrics (WER, duration, speech rate) require custom filtering logic
**Recommendation:** Use `PreserveByValueStage` for audio quality filtering, or create custom stages for transcription-specific processing.
## Related Topics
* **[Audio Processing Overview](/curate-audio/process-data)** - Complete audio processing workflow
* **[Quality Assessment](/curate-audio/process-data/quality-assessment)** - Audio quality metrics and filtering
* **[ASR Inference](/curate-audio/process-data/asr-inference)** - Speech recognition processing