For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
      • Overview
      • Text Integration
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How it Works
  • ASR Inference
  • Quality Assessment
  • Quality Filtering
  • Audio Analysis
  • ALM Data Curation
  • Text Integration
Curate AudioProcess Data

Process Data for Audio Curation

||View as Markdown|
Previous

Local Files

Next

Overview

Process audio data you’ve loaded into AudioTask objects using NeMo Curator’s comprehensive audio processing capabilities.

NeMo Curator provides a specialized suite of tools for processing speech and audio data as part of the AI training pipeline. These tools help you transcribe, analyze, filter, and integrate audio datasets to ensure high-quality input for ASR model training and multimodal applications.

How it Works

NeMo Curator’s audio processing capabilities are organized into five main categories:

  1. ASR Inference: Transcribe audio using NeMo Framework’s pretrained ASR models
  2. Quality Assessment: Calculate and filter based on transcription accuracy metrics
  3. Quality Filtering: Segment, filter, and diarize raw audio into clean single-speaker training segments
  4. Audio Analysis: Extract audio characteristics like duration and validate formats
  5. Text Integration: Convert processed audio data to text processing workflows

Each category provides GPU-accelerated implementations optimized for different speech curation needs. The result is a cleaned and filtered audio dataset with high-quality transcriptions ready for model training.


ASR Inference

Transcribe audio files using NeMo Framework’s state-of-the-art ASR models with GPU acceleration.

NeMo ASR Models

Use pretrained NeMo ASR models for accurate speech recognition pretrained multilingual gpu-accelerated

Batch Processing

Efficiently process large audio datasets with configurable batch sizes batch-inference memory-optimization scalable

Quality Assessment

Evaluate and filter audio quality using transcription accuracy and audio characteristics.

WER Filtering

Filter audio samples based on Word Error Rate thresholds accuracy quality-metrics filtering

Duration Filtering

Filter audio samples by duration ranges and speech rate metrics duration speech-rate range-filtering

Quality Filtering

Compose VAD, band, UTMOS, SIGMOS, and speaker-separation stages to extract clean single-speaker training segments from raw audio.

Quality Filtering Overview

End-to-end pipeline of preprocessing, segmentation, and filtering stages vad mos-scoring diarization

AudioDataFilterStage Composite

Single composite stage that decomposes into the full filtering pipeline from a YAML config composite yaml-config end-to-end

Audio Analysis

Extract and analyze audio file characteristics for quality control and metadata generation.

Duration Calculation

Calculate precise audio duration using soundfile library soundfile precision metadata

Format Validation

Validate audio file formats and detect corrupted files validation error-handling format-support

ALM Data Curation

Curate training data for audio language models by extracting fixed-duration windows from diarized audio segments.

ALM Data Builder

Construct candidate training windows from consecutive segments with quality filtering windowing speaker-count bandwidth

ALM Overlap Filtering

Remove redundant overlapping windows based on configurable thresholds deduplication overlap-ratio target-duration

Text Integration

Convert processed audio data to text processing workflows for multimodal applications.

Audio-to-Text Conversion

Convert AudioTask objects to DocumentBatch for text processing format-conversion pipeline-integration multimodal