For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
      • Overview
      • Deduplication
        • Overview
        • Curation Pipeline
        • Audio Task
        • ASR Pipeline
        • Quality Metrics
        • Manifests and Ingest
        • ALM Pipeline
        • Text Integration
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • High-Level Flow
  • Core Components
  • Common Workflows
About NeMo CuratorConceptsAudio Concepts

Audio Curation Pipeline (Overview)

||View as Markdown|
Previous

Overview

Next

Audio Task

This guide provides an overview of the end-to-end audio curation workflow in NVIDIA NeMo Curator. It covers data ingestion and validation, optional ASR inference, quality assessment, filtering, and export or conversion. For detailed ASR pipeline information, refer to ASR Pipeline.

High-Level Flow

Core Components

Data Ingestion and Validation:

  • AudioTask file existence checks using validate() and validate_item()
  • Manifest format validation and metadata consistency
  • Recommended JSONL manifest format

Optional ASR Inference:

  • InferenceAsrNemoStage for automatic speech recognition
  • Configurable batch processing with batch_size and resources parameters
  • Support for multiple NeMo ASR models

Quality Assessment:

  • Audio duration analysis with GetAudioDurationStage
  • Word Error Rate (WER) and Character Error Rate (CER) calculation
  • Speech rate metrics including words per second and characters per second

Filtering and Quality Control:

  • Threshold-based filtering using PreserveByValueStage
  • Configurable quality thresholds for WER, duration, and speech rate

Export and Format Conversion:

  • Audio-to-text conversion with AudioToDocumentStage
  • Integration with text processing workflows

Common Workflows

ASR-First Workflow (Most Common):

  1. Load audio files into AudioTask format
  2. Apply ASR inference to generate transcriptions
  3. Calculate quality metrics (WER, duration, speech rate)
  4. Apply threshold-based filtering
  5. Convert to DocumentBatch for text processing integration
  6. Export filtered, high-quality audio-text pairs

Quality-First Workflow (No ASR Required):

  1. Load audio files with existing transcriptions
  2. Extract audio characteristics (duration, format, sample rate)
  3. Apply basic quality filters
  4. Export validated audio dataset