Curate AudioProcess Data

Process Data for Audio Curation

View as Markdown

Process audio data you’ve loaded into AudioTask objects using NeMo Curator’s comprehensive audio processing capabilities.

NeMo Curator provides a specialized suite of tools for processing speech and audio data as part of the AI training pipeline. These tools help you transcribe, analyze, filter, and integrate audio datasets to ensure high-quality input for ASR model training and multimodal applications.

How it Works

NeMo Curator’s audio processing capabilities are organized into five main categories:

  1. ASR Inference: Transcribe audio using NeMo Framework’s pretrained ASR models
  2. Quality Assessment: Calculate and filter based on transcription accuracy metrics
  3. Quality Filtering: Segment, filter, and diarize raw audio into clean single-speaker training segments
  4. Audio Analysis: Extract audio characteristics like duration and validate formats
  5. Text Integration: Convert processed audio data to text processing workflows

Each category provides GPU-accelerated implementations optimized for different speech curation needs. The result is a cleaned and filtered audio dataset with high-quality transcriptions ready for model training.


ASR Inference

Transcribe audio files using NeMo Framework’s state-of-the-art ASR models with GPU acceleration.

Quality Assessment

Evaluate and filter audio quality using transcription accuracy and audio characteristics.

Quality Filtering

Compose VAD, band, UTMOS, SIGMOS, and speaker-separation stages to extract clean single-speaker training segments from raw audio.

Audio Analysis

Extract and analyze audio file characteristics for quality control and metadata generation.

ALM Data Curation

Curate training data for audio language models by extracting fixed-duration windows from diarized audio segments.

Text Integration

Convert processed audio data to text processing workflows for multimodal applications.