Process Data for Audio Curation#

Process audio data you’ve loaded into AudioBatch objects using NeMo Curator’s comprehensive audio processing capabilities.

NeMo Curator provides a specialized suite of tools for processing speech and audio data as part of the AI training pipeline. These tools help you transcribe, analyze, filter, and integrate audio datasets to ensure high-quality input for ASR model training and multimodal applications.

How it Works#

NeMo Curator’s audio processing capabilities are organized into four main categories:

  1. ASR Inference: Transcribe audio using NeMo Framework’s pretrained ASR models

  2. Quality Assessment: Calculate and filter based on transcription accuracy metrics

  3. Audio Analysis: Extract audio characteristics like duration and validate formats

  4. Text Integration: Convert processed audio data to text processing workflows

Each category provides GPU-accelerated implementations optimized for different speech curation needs. The result is a cleaned and filtered audio dataset with high-quality transcriptions ready for model training.


ASR Inference#

Transcribe audio files using NeMo Framework’s state-of-the-art ASR models with GPU acceleration.

NeMo ASR Models

Use pretrained NeMo ASR models for accurate speech recognition

NeMo ASR Models
Batch Processing

Efficiently process large audio datasets with configurable batch sizes

ASR Inference

Quality Assessment#

Evaluate and filter audio quality using transcription accuracy and audio characteristics.

WER Filtering

Filter audio samples based on Word Error Rate thresholds

WER Filtering
Duration Filtering

Filter audio samples by duration ranges and speech rate metrics

Duration Filtering

Audio Analysis#

Extract and analyze audio file characteristics for quality control and metadata generation.

Duration Calculation

Calculate precise audio duration using soundfile library

Duration Calculation
Format Validation

Validate audio file formats and detect corrupted files

Audio Format Support

Text Integration#

Convert processed audio data to text processing workflows for multimodal applications.

Audio-to-Text Conversion

Convert AudioBatch objects to DocumentBatch for text processing

Text Integration for Audio Data