*** description: >- Concepts for integrating audio processing with text curation workflows in multimodal applications categories: * concepts-architecture tags: * text-integration * multimodal * workflow-integration * format-conversion * cross-modal personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: concept modality: audio-text *** # Audio-Text Integration Concepts This guide covers how audio processing integrates with text curation workflows in NeMo Curator, enabling seamless multi-modal data preparation and cross-modal quality assessment. ## Integration Architecture Audio-text integration in NeMo Curator operates on several levels: ### Data Structure Integration **Format Conversion**: `AudioBatch` to `DocumentBatch` * All fields remain intact without remapping (`text` and `pred_text` remain unchanged) * Audio metadata remains available as extra fields for downstream processing **Metadata Preservation**: Audio characteristics remain intact during conversion * File paths remain available for traceability and debugging * Quality metrics (WER, duration) remain available for filtering operations * Audio-specific metadata remains available for downstream processing stages ### Pipeline Integration **Sequential Processing**: Audio to Text to Multi-Modal ```mermaid flowchart LR A[Audio Files] --> B[InferenceAsrNemoStage] B --> C[AudioToDocumentStage] C --> D[ScoreFilter
Text Processing] D --> E[Integrated Output] style A fill:#e1f5fe style C fill:#ffcc02 style E fill:#fff3e0 ``` The `AudioToDocumentStage` provides the conversion bridge between audio and text processing workflows. **Parallel Processing**: Simultaneous audio and text analysis ```mermaid flowchart LR A[Audio Files] --> B[InferenceAsrNemoStage] C[Text Data] --> D[ScoreFilter
Text Processing] B --> E[Cross-Modal
Quality Assessment] D --> E E --> F[Filtered Output] style A fill:#e1f5fe style C fill:#e8f5e8 style F fill:#fff3e0 ``` ## Cross-Modal Quality Assessment ### Audio-Informed Text Quality Use audio characteristics to enhance text quality assessment: **Speech Rate Analysis**: Detect unnaturally fast or slow speech patterns using the `get_wordrate()` function **Duration-Text Consistency**: Ensure transcription length matches audio duration * Short audio with long text: Potential transcription errors * Long audio with short text: Potential missing content * Optimal ratio: \~3-5 characters per second of audio ### Text-Informed Audio Quality Use text characteristics to assess audio quality: **Transcription Completeness**: Detect incomplete or truncated speech * Sentence fragments without proper endings * Unusual punctuation patterns * Incomplete words or phrases **Content Coherence**: Assess semantic consistency * Logical flow and coherence in transcriptions * Domain-appropriate vocabulary usage * Language consistency throughout sample ## Workflow Patterns ### Audio-First Workflows Start with audio processing, then apply text curation: ```mermaid flowchart TD A[Audio Files] --> B[InferenceAsrNemoStage
ASR Transcription] B --> C[GetPairwiseWerStage
Calculate WER Metrics] C --> D[ScoreFilter
WER-based Filtering] D --> E[AudioToDocumentStage
Convert to DocumentBatch] E --> F[ScoreFilter
Text Quality Assessment] F --> G[Filter
Metadata-based Filtering] G --> H[Text Enhancement Stages] H --> I[Processed Dataset] style A fill:#e1f5fe style E fill:#fff3e0 style I fill:#e8f5e8 classDef audioStage fill:#bbdefb classDef conversionStage fill:#ffcc02 classDef textStage fill:#c8e6c9 class B,C,D audioStage class E conversionStage class F,G,H textStage ``` **Use Cases**: * Speech dataset curation for ASR training * Podcast transcription and processing * Lecture and educational content preparation ### Text-First Workflows Start with text processing, then verify with audio: ```mermaid flowchart TD A[Text Corpus] --> B[ScoreFilter
Text Quality Assessment] B --> C[Filter
Initial Text Filtering] C --> D[Audio File Matching
Custom Stage] D --> E[InferenceAsrNemoStage
ASR Validation] E --> F[GetPairwiseWerStage
Cross-Modal Metrics] F --> G[ScoreFilter
Consistency Filtering] G --> H[Validated Dataset] style A fill:#e8f5e8 style D fill:#fff3e0 style H fill:#e1f5fe classDef textStage fill:#c8e6c9 classDef matchingStage fill:#ffcc02 classDef audioStage fill:#bbdefb class B,C textStage class D matchingStage class E,F,G audioStage ``` **Use Cases**: * Validating existing transcriptions with audio * Creating audio-text pairs from separate sources * Quality control for crowd-sourced transcriptions ## Data Flow Concepts ### Conversion Mechanisms **AudioBatch to DocumentBatch**: NeMo Curator provides the `AudioToDocumentStage` for converting audio processing results to text processing format: ```python from nemo_curator.stages.audio.io.convert import AudioToDocumentStage # Create the conversion stage converter = AudioToDocumentStage() # Example input AudioBatch data audio_data = { "audio_filepath": "/audio.wav", "text": "ground truth", "pred_text": "asr prediction", "wer": 15.2, "duration": 3.4 } # The stage returns a list containing one DocumentBatch # with the same fields preserved as pandas DataFrame document_batches = converter.process(audio_batch) document_batch = document_batches[0] # Extract the single DocumentBatch # All fields are preserved in the DocumentBatch: # - audio_filepath, text, pred_text, wer, duration ``` **Note**: A built-in `DocumentBatch` to `AudioBatch` conversion stage is not provided. Create a custom stage if you need reverse conversion. For practical usage examples and step-by-step implementation, refer to [Text integration](/curate-audio/process-data/text-integration). ### Metadata Flow **Additive Processing**: Processing stages typically add metadata without removing existing fields ```python # Stage 1: Initial loading stage1_output = {"audio_filepath": "/audio.wav", "text": "transcription"} # Stage 2: ASR inference stage2_output = {**stage1_output, "pred_text": "asr result"} # Stage 3: Quality assessment stage3_output = {**stage2_output, "wer": 15.2, "duration": 3.4} # Stage 4: Text processing (after conversion) stage4_output = {**stage3_output, "word_count": 6, "language": "en"} ``` ## Quality Assessment Integration ### Available Quality Metrics NeMo Curator provides these audio quality assessment capabilities: **Word Error Rate (WER) Analysis**: * WER correlates with transcription accuracy * Available through `GetPairwiseWerStage` * Measures percentage of incorrect words between ground truth and ASR predictions **Duration and Speech Rate Analysis**: * Duration validation using `GetAudioDurationStage` * Speech rate calculation using `get_wordrate()` function * Character rate calculation using `get_charrate()` function **Individual Quality Dimensions**: * **Technical Quality**: File integrity, format compliance, duration validation * **Content Quality**: Transcription accuracy via WER/CER metrics * **Speech Rate Quality**: Words/characters per second analysis ## Performance and Scaling ### Memory Considerations **AudioBatch Memory Usage**: * Metadata storage scales linearly with batch size * Audio files loaded on-demand, not cached in memory * Large batches increase processing efficiency but consume more RAM **Conversion Overhead**: * AudioBatch → DocumentBatch conversion is lightweight * Metadata copying has minimal performance impact * Batch size affects conversion performance ### Processing Efficiency **Sequential vs. Parallel Integration**: **Sequential Processing**: Audio to Text (lower memory, slower) ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage from nemo_curator.stages.audio.io.convert import AudioToDocumentStage from nemo_curator.stages.text.modules.score_filter import ScoreFilter from nemo_curator.filters import WordCountFilter # Example filter # Define a text quality filter text_quality_filter = WordCountFilter(min_words=10) # Process audio completely first audio_pipeline = Pipeline( name="audio_processing", stages=[ InferenceAsrNemoStage(model_name="stt_en_fastconformer_transducer_large"), AudioToDocumentStage() ] ) audio_results = audio_pipeline.run(executor) # Then process text text_pipeline = Pipeline( name="text_processing", stages=[ ScoreFilter(filter_obj=text_quality_filter) ] ) final_results = text_pipeline.run(executor, initial_tasks=audio_results) ``` **Parallel Processing**: Audio and Text (higher memory, faster) ```python # Run a single pipeline that includes both audio and text stages using an executor from nemo_curator.pipeline import Pipeline pipeline = Pipeline(name="audio_text", stages=[ InferenceAsrNemoStage(model_name="stt_en_fastconformer_transducer_large"), GetPairwiseWerStage(), AudioToDocumentStage(), ScoreFilter(filter_obj=text_quality_filter) ]) results = pipeline.run() ``` ### Scaling Strategies **Horizontal Scaling**: Distribute across several workers * Partition audio files across workers * Independent processing with final aggregation * Load balancing based on audio duration **Vertical Scaling**: Optimize single-machine performance * GPU acceleration for ASR inference * Batch size optimization for hardware * Memory management for large datasets ## Design Principles ### Modularity **Separation of Concerns**: Audio and text processing remain independent * Audio stages focus on speech-specific operations * Text stages handle language processing * Integration stages manage cross-modal operations **Modular Architecture**: Mix and match audio and text processing stages * Flexible pipeline construction * Reusable stage components * Configurable integration points ### Extensibility **Custom Integration Patterns**: Support for domain-specific workflows * Custom conversion logic * Specialized quality metrics * Domain-specific filtering rules **Plugin Architecture**: Easy addition of new integration methods * Custom stage implementations * External tool integration * Specialized format support