Audio-Text Integration Concepts
This guide covers how audio processing integrates with text curation workflows in NeMo Curator, enabling seamless multi-modal data preparation and cross-modal quality assessment.
Integration Architecture
Audio-text integration in NeMo Curator operates on several levels:
Data Structure Integration
Format Conversion: AudioBatch to DocumentBatch
- All fields remain intact without remapping (
textandpred_textremain unchanged) - Audio metadata remains available as extra fields for downstream processing
Metadata Preservation: Audio characteristics remain intact during conversion
- File paths remain available for traceability and debugging
- Quality metrics (WER, duration) remain available for filtering operations
- Audio-specific metadata remains available for downstream processing stages
Pipeline Integration
Sequential Processing: Audio to Text to Multi-Modal
The AudioToDocumentStage provides the conversion bridge between audio and text processing workflows.
Parallel Processing: Simultaneous audio and text analysis
Cross-Modal Quality Assessment
Audio-Informed Text Quality
Use audio characteristics to enhance text quality assessment:
Speech Rate Analysis: Detect unnaturally fast or slow speech patterns using the get_wordrate() function
Duration-Text Consistency: Ensure transcription length matches audio duration
- Short audio with long text: Potential transcription errors
- Long audio with short text: Potential missing content
- Optimal ratio: ~3-5 characters per second of audio
Text-Informed Audio Quality
Use text characteristics to assess audio quality:
Transcription Completeness: Detect incomplete or truncated speech
- Sentence fragments without proper endings
- Unusual punctuation patterns
- Incomplete words or phrases
Content Coherence: Assess semantic consistency
- Logical flow and coherence in transcriptions
- Domain-appropriate vocabulary usage
- Language consistency throughout sample
Workflow Patterns
Audio-First Workflows
Start with audio processing, then apply text curation:
Use Cases:
- Speech dataset curation for ASR training
- Podcast transcription and processing
- Lecture and educational content preparation
Text-First Workflows
Start with text processing, then verify with audio:
Use Cases:
- Validating existing transcriptions with audio
- Creating audio-text pairs from separate sources
- Quality control for crowd-sourced transcriptions
Data Flow Concepts
Conversion Mechanisms
AudioBatch to DocumentBatch:
NeMo Curator provides the AudioToDocumentStage for converting audio processing results to text processing format:
Note: A built-in DocumentBatch to AudioBatch conversion stage is not provided. Create a custom stage if you need reverse conversion.
For practical usage examples and step-by-step implementation, refer to Text integration.
Metadata Flow
Additive Processing: Processing stages typically add metadata without removing existing fields
Quality Assessment Integration
Available Quality Metrics
NeMo Curator provides these audio quality assessment capabilities:
Word Error Rate (WER) Analysis:
- WER correlates with transcription accuracy
- Available through
GetPairwiseWerStage - Measures percentage of incorrect words between ground truth and ASR predictions
Duration and Speech Rate Analysis:
- Duration validation using
GetAudioDurationStage - Speech rate calculation using
get_wordrate()function - Character rate calculation using
get_charrate()function
Individual Quality Dimensions:
- Technical Quality: File integrity, format compliance, duration validation
- Content Quality: Transcription accuracy via WER/CER metrics
- Speech Rate Quality: Words/characters per second analysis
Performance and Scaling
Memory Considerations
AudioBatch Memory Usage:
- Metadata storage scales linearly with batch size
- Audio files loaded on-demand, not cached in memory
- Large batches increase processing efficiency but consume more RAM
Conversion Overhead:
- AudioBatch → DocumentBatch conversion is lightweight
- Metadata copying has minimal performance impact
- Batch size affects conversion performance
Processing Efficiency
Sequential vs. Parallel Integration:
Sequential Processing: Audio to Text (lower memory, slower)
Parallel Processing: Audio and Text (higher memory, faster)
Scaling Strategies
Horizontal Scaling: Distribute across several workers
- Partition audio files across workers
- Independent processing with final aggregation
- Load balancing based on audio duration
Vertical Scaling: Optimize single-machine performance
- GPU acceleration for ASR inference
- Batch size optimization for hardware
- Memory management for large datasets
Design Principles
Modularity
Separation of Concerns: Audio and text processing remain independent
- Audio stages focus on speech-specific operations
- Text stages handle language processing
- Integration stages manage cross-modal operations
Modular Architecture: Mix and match audio and text processing stages
- Flexible pipeline construction
- Reusable stage components
- Configurable integration points
Extensibility
Custom Integration Patterns: Support for domain-specific workflows
- Custom conversion logic
- Specialized quality metrics
- Domain-specific filtering rules
Plugin Architecture: Easy addition of new integration methods
- Custom stage implementations
- External tool integration
- Specialized format support