***

description: Convert processed audio data to DocumentBatch format for downstream processing
categories:

* audio-processing
  tags:
* format-conversion
* audio-to-text
* documentbatch
  personas:
* data-scientist-focused
* mle-focused
  difficulty: beginner
  content\_type: how-to
  modality: audio-text

***

# Text Integration for Audio Data

Convert processed audio data from `AudioBatch` to `DocumentBatch` format using the built-in `AudioToDocumentStage`. This enables you to export audio processing results or integrate with custom text processing workflows.

## How it Works

The `AudioToDocumentStage` provides straightforward format conversion between NeMo Curator's audio and text data structures:

1. **Format Conversion**: Transform `AudioBatch` objects to `DocumentBatch` format
2. **Metadata Preservation**: All fields from the audio data are preserved in the conversion
3. **Export Ready**: Convert audio processing results to pandas DataFrame format for analysis or export

**Common use cases:**

* Export ASR results and quality metrics for analysis
* Save filtered audio datasets with transcriptions
* Integrate audio processing outputs with downstream text workflows

## Basic Conversion

### AudioBatch to DocumentBatch

Use `AudioToDocumentStage` to convert audio processing results to document format:

```python
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.tasks import AudioBatch

# Convert audio data to DocumentBatch format
converter = AudioToDocumentStage()

# Input: AudioBatch with audio processing results
audio_batch = AudioBatch(data=[
    {
        "audio_filepath": "/data/audio/sample.wav",
        "text": "ground truth text",
        "pred_text": "asr predicted text",
        "wer": 12.5,
        "duration": 3.2
    }
])

# Output: DocumentBatch with pandas DataFrame
document_batches = converter.process(audio_batch)
document_batch = document_batches[0]

# Access the converted data
print(f"Converted {len(document_batch.data)} audio records to DocumentBatch")
```

**Parameters:**

* `AudioToDocumentStage()` has no configuration parameters; it performs direct format conversion

**Returns:**

* List of `DocumentBatch` objects containing a pandas DataFrame with all original audio fields

### What Gets Preserved

The conversion preserves all fields from your audio processing pipeline:

```python
# All audio processing results are maintained:
# - audio_filepath: Original audio file reference
# - text: Ground truth transcription (if available)
# - pred_text: ASR prediction
# - wer: Word Error Rate (if calculated)
# - duration: Audio duration (if calculated)
# - Any other metadata fields you've added
```

<Note>
  Field names and values are preserved exactly as they appear in the `AudioBatch`. No data transformation or cleaning is performed during conversion.
</Note>

## Integration in Pipelines

### Complete Audio Processing with Export

The most common use case is adding `AudioToDocumentStage` at the end of your audio pipeline to enable result export:

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
from nemo_curator.stages.audio.common import GetAudioDurationStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.resources import Resources

# Create pipeline that processes audio and exports results
pipeline = Pipeline(name="audio_processing_with_export")

# 1. Load audio data
pipeline.add_stage(CreateInitialManifestFleursStage(
    lang="en_us",
    split="test",
    raw_data_dir="./audio_data"
).with_(batch_size=8))

# 2. Run ASR inference
pipeline.add_stage(InferenceAsrNemoStage(
    model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
    pred_text_key="pred_text"
).with_(resources=Resources(gpus=1.0)))

# 3. Calculate quality metrics
pipeline.add_stage(GetPairwiseWerStage(
    text_key="text",
    pred_text_key="pred_text",
    wer_key="wer"
))
pipeline.add_stage(GetAudioDurationStage(
    audio_filepath_key="audio_filepath",
    duration_key="duration"
))

# 4. Convert to DocumentBatch for export
pipeline.add_stage(AudioToDocumentStage())

# 5. Export to JSONL format
pipeline.add_stage(JsonlWriter(path="/output/processed_audio_results"))

# Execute pipeline
executor = XennaExecutor()
pipeline.run(executor)
```

**Output format:** The `JsonlWriter` creates a JSONL file where each line contains one audio sample with all fields:

```json
{"audio_filepath": "/data/audio/sample1.wav", "text": "hello world", "pred_text": "hello world", "wer": 0.0, "duration": 1.5}
{"audio_filepath": "/data/audio/sample2.wav", "text": "test audio", "pred_text": "test odio", "wer": 50.0, "duration": 2.1}
```

## Custom Integration

While `AudioToDocumentStage` converts audio data to `DocumentBatch` format, NeMo Curator's built-in text processing stages (filters, classifiers, etc.) are designed for text documents, not audio transcriptions. For audio-specific text processing, implement custom stages that operate on the converted `DocumentBatch` data.

### Example: Custom Text Processing

```python
from nemo_curator.stages.function_decorators import processing_stage
from nemo_curator.tasks import DocumentBatch
import pandas as pd

@processing_stage(name="custom_transcription_filter")
def filter_transcriptions(document_batch: DocumentBatch) -> DocumentBatch:
    """Custom filtering of ASR transcriptions."""

    # Access the pandas DataFrame
    df = document_batch.data

    # Example: Filter by transcription length
    df = df[df['pred_text'].str.len() &gt;10]  # Keep transcriptions &gt;10 chars

    # Example: Filter by WER if available
    if 'wer' in df.columns:
        df = df[df['wer'] < 50.0]  # Keep WER < 50%

    return DocumentBatch(
        data=df,
        task_id=document_batch.task_id,
        dataset_name=document_batch.dataset_name
    )
```

## Output Format

After conversion, your data will be in `DocumentBatch` format with a pandas DataFrame:

```python
# Example output structure
document_batch.data  # pandas DataFrame with columns:
# - audio_filepath: "/path/to/audio.wav"
# - text: "ground truth transcription"
# - pred_text: "asr prediction"
# - wer: 15.2
# - duration: 3.4
# - [any other fields from your audio processing]
```

## Limitations

<Info>
  **Text Processing Integration**: NeMo Curator's text processing stages are designed for `DocumentBatch` inputs (text documents such as articles, web pages), but they are not designed for audio-derived transcriptions. You should implement custom processing stages for audio-specific workflows.

  **Reasons for incompatibility:**

  * Text filters assume document-level content (e.g., paragraph structure, word count thresholds designed for articles)
  * ASR transcriptions have different characteristics (shorter, can contain recognition errors, conversational language)
  * Audio-specific metrics (WER, duration, speech rate) require custom filtering logic

  **Recommendation:** Use `PreserveByValueStage` for audio quality filtering, or create custom stages for transcription-specific processing.
</Info>

## Related Topics

* **[Audio Processing Overview](/curate-audio/process-data)** - Complete audio processing workflow
* **[Quality Assessment](/curate-audio/process-data/quality-assessment)** - Audio quality metrics and filtering
* **[ASR Inference](/curate-audio/process-data/asr-inference)** - Speech recognition processing
