About NeMo CuratorConceptsAudio Concepts

Audio-Text Integration Concepts

View as Markdown

This guide covers how audio processing integrates with text curation workflows in NeMo Curator, enabling seamless multi-modal data preparation and cross-modal quality assessment.

Integration Architecture

Audio-text integration in NeMo Curator operates on several levels:

Data Structure Integration

Format Conversion: AudioBatch to DocumentBatch

  • All fields remain intact without remapping (text and pred_text remain unchanged)
  • Audio metadata remains available as extra fields for downstream processing

Metadata Preservation: Audio characteristics remain intact during conversion

  • File paths remain available for traceability and debugging
  • Quality metrics (WER, duration) remain available for filtering operations
  • Audio-specific metadata remains available for downstream processing stages

Pipeline Integration

Sequential Processing: Audio to Text to Multi-Modal

The AudioToDocumentStage provides the conversion bridge between audio and text processing workflows.

Parallel Processing: Simultaneous audio and text analysis

Cross-Modal Quality Assessment

Audio-Informed Text Quality

Use audio characteristics to enhance text quality assessment:

Speech Rate Analysis: Detect unnaturally fast or slow speech patterns using the get_wordrate() function

Duration-Text Consistency: Ensure transcription length matches audio duration

  • Short audio with long text: Potential transcription errors
  • Long audio with short text: Potential missing content
  • Optimal ratio: ~3-5 characters per second of audio

Text-Informed Audio Quality

Use text characteristics to assess audio quality:

Transcription Completeness: Detect incomplete or truncated speech

  • Sentence fragments without proper endings
  • Unusual punctuation patterns
  • Incomplete words or phrases

Content Coherence: Assess semantic consistency

  • Logical flow and coherence in transcriptions
  • Domain-appropriate vocabulary usage
  • Language consistency throughout sample

Workflow Patterns

Audio-First Workflows

Start with audio processing, then apply text curation:

Use Cases:

  • Speech dataset curation for ASR training
  • Podcast transcription and processing
  • Lecture and educational content preparation

Text-First Workflows

Start with text processing, then verify with audio:

Use Cases:

  • Validating existing transcriptions with audio
  • Creating audio-text pairs from separate sources
  • Quality control for crowd-sourced transcriptions

Data Flow Concepts

Conversion Mechanisms

AudioBatch to DocumentBatch:

NeMo Curator provides the AudioToDocumentStage for converting audio processing results to text processing format:

1from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
2
3# Create the conversion stage
4converter = AudioToDocumentStage()
5
6# Example input AudioBatch data
7audio_data = {
8 "audio_filepath": "/audio.wav",
9 "text": "ground truth",
10 "pred_text": "asr prediction",
11 "wer": 15.2,
12 "duration": 3.4
13}
14
15# The stage returns a list containing one DocumentBatch
16# with the same fields preserved as pandas DataFrame
17document_batches = converter.process(audio_batch)
18document_batch = document_batches[0] # Extract the single DocumentBatch
19
20# All fields are preserved in the DocumentBatch:
21# - audio_filepath, text, pred_text, wer, duration

Note: A built-in DocumentBatch to AudioBatch conversion stage is not provided. Create a custom stage if you need reverse conversion.

For practical usage examples and step-by-step implementation, refer to Text integration.

Metadata Flow

Additive Processing: Processing stages typically add metadata without removing existing fields

1# Stage 1: Initial loading
2stage1_output = {"audio_filepath": "/audio.wav", "text": "transcription"}
3
4# Stage 2: ASR inference
5stage2_output = {**stage1_output, "pred_text": "asr result"}
6
7# Stage 3: Quality assessment
8stage3_output = {**stage2_output, "wer": 15.2, "duration": 3.4}
9
10# Stage 4: Text processing (after conversion)
11stage4_output = {**stage3_output, "word_count": 6, "language": "en"}

Quality Assessment Integration

Available Quality Metrics

NeMo Curator provides these audio quality assessment capabilities:

Word Error Rate (WER) Analysis:

  • WER correlates with transcription accuracy
  • Available through GetPairwiseWerStage
  • Measures percentage of incorrect words between ground truth and ASR predictions

Duration and Speech Rate Analysis:

  • Duration validation using GetAudioDurationStage
  • Speech rate calculation using get_wordrate() function
  • Character rate calculation using get_charrate() function

Individual Quality Dimensions:

  • Technical Quality: File integrity, format compliance, duration validation
  • Content Quality: Transcription accuracy via WER/CER metrics
  • Speech Rate Quality: Words/characters per second analysis

Performance and Scaling

Memory Considerations

AudioBatch Memory Usage:

  • Metadata storage scales linearly with batch size
  • Audio files loaded on-demand, not cached in memory
  • Large batches increase processing efficiency but consume more RAM

Conversion Overhead:

  • AudioBatch → DocumentBatch conversion is lightweight
  • Metadata copying has minimal performance impact
  • Batch size affects conversion performance

Processing Efficiency

Sequential vs. Parallel Integration:

Sequential Processing: Audio to Text (lower memory, slower)

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
3from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
4from nemo_curator.stages.text.modules.score_filter import ScoreFilter
5from nemo_curator.filters import WordCountFilter # Example filter
6
7# Define a text quality filter
8text_quality_filter = WordCountFilter(min_words=10)
9
10# Process audio completely first
11audio_pipeline = Pipeline(
12 name="audio_processing",
13 stages=[
14 InferenceAsrNemoStage(model_name="stt_en_fastconformer_transducer_large"),
15 AudioToDocumentStage()
16 ]
17)
18audio_results = audio_pipeline.run(executor)
19
20# Then process text
21text_pipeline = Pipeline(
22 name="text_processing",
23 stages=[
24 ScoreFilter(filter_obj=text_quality_filter)
25 ]
26)
27final_results = text_pipeline.run(executor, initial_tasks=audio_results)

Parallel Processing: Audio and Text (higher memory, faster)

1# Run a single pipeline that includes both audio and text stages using an executor
2from nemo_curator.pipeline import Pipeline
3
4pipeline = Pipeline(name="audio_text", stages=[
5 InferenceAsrNemoStage(model_name="stt_en_fastconformer_transducer_large"),
6 GetPairwiseWerStage(),
7 AudioToDocumentStage(),
8 ScoreFilter(filter_obj=text_quality_filter)
9])
10results = pipeline.run()

Scaling Strategies

Horizontal Scaling: Distribute across several workers

  • Partition audio files across workers
  • Independent processing with final aggregation
  • Load balancing based on audio duration

Vertical Scaling: Optimize single-machine performance

  • GPU acceleration for ASR inference
  • Batch size optimization for hardware
  • Memory management for large datasets

Design Principles

Modularity

Separation of Concerns: Audio and text processing remain independent

  • Audio stages focus on speech-specific operations
  • Text stages handle language processing
  • Integration stages manage cross-modal operations

Modular Architecture: Mix and match audio and text processing stages

  • Flexible pipeline construction
  • Reusable stage components
  • Configurable integration points

Extensibility

Custom Integration Patterns: Support for domain-specific workflows

  • Custom conversion logic
  • Specialized quality metrics
  • Domain-specific filtering rules

Plugin Architecture: Easy addition of new integration methods

  • Custom stage implementations
  • External tool integration
  • Specialized format support