Text Integration for Audio Data
Convert processed audio data from AudioBatch to DocumentBatch format using the built-in AudioToDocumentStage. This enables you to export audio processing results or integrate with custom text processing workflows.
How it Works
The AudioToDocumentStage provides straightforward format conversion between NeMo Curator’s audio and text data structures:
- Format Conversion: Transform
AudioBatchobjects toDocumentBatchformat - Metadata Preservation: All fields from the audio data are preserved in the conversion
- Export Ready: Convert audio processing results to pandas DataFrame format for analysis or export
Common use cases:
- Export ASR results and quality metrics for analysis
- Save filtered audio datasets with transcriptions
- Integrate audio processing outputs with downstream text workflows
Basic Conversion
AudioBatch to DocumentBatch
Use AudioToDocumentStage to convert audio processing results to document format:
Parameters:
AudioToDocumentStage()has no configuration parameters; it performs direct format conversion
Returns:
- List of
DocumentBatchobjects containing a pandas DataFrame with all original audio fields
What Gets Preserved
The conversion preserves all fields from your audio processing pipeline:
Field names and values are preserved exactly as they appear in the AudioBatch. No data transformation or cleaning is performed during conversion.
Integration in Pipelines
Complete Audio Processing with Export
The most common use case is adding AudioToDocumentStage at the end of your audio pipeline to enable result export:
Output format: The JsonlWriter creates a JSONL file where each line contains one audio sample with all fields:
Custom Integration
While AudioToDocumentStage converts audio data to DocumentBatch format, NeMo Curator’s built-in text processing stages (filters, classifiers, etc.) are designed for text documents, not audio transcriptions. For audio-specific text processing, implement custom stages that operate on the converted DocumentBatch data.
Example: Custom Text Processing
Output Format
After conversion, your data will be in DocumentBatch format with a pandas DataFrame:
Limitations
Text Processing Integration: NeMo Curator’s text processing stages are designed for DocumentBatch inputs (text documents such as articles, web pages), but they are not designed for audio-derived transcriptions. You should implement custom processing stages for audio-specific workflows.
Reasons for incompatibility:
- Text filters assume document-level content (e.g., paragraph structure, word count thresholds designed for articles)
- ASR transcriptions have different characteristics (shorter, can contain recognition errors, conversational language)
- Audio-specific metrics (WER, duration, speech rate) require custom filtering logic
Recommendation: Use PreserveByValueStage for audio quality filtering, or create custom stages for transcription-specific processing.
Related Topics
- Audio Processing Overview - Complete audio processing workflow
- Quality Assessment - Audio quality metrics and filtering
- ASR Inference - Speech recognition processing