Skip to content

Audio and video ingestion

Use this page for speech and audio extraction with Parakeet ASR and for video workflows that combine audio with OCR on frames or derived images.

For air-gapped or disconnected deployments, see Air-gapped and disconnected deployment.

Sections: Speech and audio (Parakeet) · Run Parakeet on the cluster (Helm) · Parakeet with hosted inference (build.nvidia.com) · Video and frame OCR

Speech and audio extraction

This documentation describes two ways to run NeMo Retriever Library with the parakeet-1-1b-ctc-en-us ASR NIM microservice (nvcr.io/nim/nvidia/parakeet-1-1b-ctc-en-us) to extract speech from audio files:

  • Run the NIM locally on your cluster with the NeMo Retriever Helm chart
  • Use NVIDIA Cloud Functions (NVCF) endpoints for cloud-based inference

Supported file types for speech extraction today:

NeMo Retriever Library supports extracting speech from audio for Retrieval Augmented Generation (RAG). Similar to how the multimodal document pipeline uses detection and OCR microservices, NeMo Retriever Library uses the parakeet-1-1b-ctc-en-us ASR NIM to transcribe speech to text, then embeddings via the NeMo Retriever embedding path.

Before running audio extraction from Python with either self-hosted or hosted Parakeet, install the multimedia extra so the Parakeet ASR client can decode and resample audio:

pip install "nemo-retriever[multimedia]"
# For local GPU inference, include both extras:
pip install "nemo-retriever[local,multimedia]"

The Python package includes the ffmpeg-python wrapper, and the multimedia extra adds Python libraries for audio decoding and resampling. These Python dependencies do not install the ffmpeg or ffprobe command-line binaries. For audio and video workflows, install system FFmpeg so both binaries are on PATH:

sudo apt-get update && sudo apt-get install -y --no-install-recommends ffmpeg

Containers use the FFmpeg package from the base Ubuntu image, rather than a source-built FFmpeg release. If your workflow depends on exact FFmpeg version or codec behavior, verify the package inside the image against those requirements.

For Kubernetes deployments with network access to package repositories, set service.installFfmpeg=true in the Helm chart to install ffmpeg/ffprobe at service startup. This runtime path requires package-repository network egress, a writable root filesystem, and a security policy that allows the image's scoped sudo use. For air-gapped clusters, see Air-gapped and disconnected deployment.

Important

Due to limitations in available VRAM controls in the current release, the parakeet-1-1b-ctc-en-us ASR NIM must run on a dedicated additional GPU. For the full list of requirements, refer to the Pre-Requisites & Support Matrix.

This pipeline enables retrieval at the speech segment level when you enable segmenting (see examples below).

Overview diagram

Run Parakeet on the cluster (Helm)

Use the following procedure to run the NIM on your own infrastructure. Self-hosted Parakeet runs on Kubernetes via the NeMo Retriever Helm chart. Enable the ASR NIM per Optional Helm NIMs and the Helm chart — NIM operator sub-stack; pin the workload to a dedicated GPU and wire the ASR endpoint in your pipeline.

Important

Pin the Parakeet workload to the dedicated GPU with your Helm values or the NIM Operator (for example, node selectors, resource limits, or device requests appropriate to your cluster).

  1. Deploy or upgrade with the NeMo Retriever Helm chart and enable Parakeet for your release (see Optional Helm NIMs). Follow Deployment options.

  2. If the service will process audio or video files, set service.installFfmpeg=true in the Helm chart when your cluster allows runtime package installation; for air-gapped clusters, see Air-gapped and disconnected deployment and the Helm chart README for service.image overrides.

  3. After the services are running, interact with the pipeline from Python.

    • The Ingestor object initializes the ingestion process.
    • The files method specifies the input files to process.
    • The extract_audio method runs audio extraction.
    from nemo_retriever.params.models import ASRParams
    
    ingestor = (
        Ingestor()
        .files("./data/*.wav")
        .extract_audio(
            asr_params=ASRParams(segment_audio=True),
        )
    )
    

    To generate one extracted element for each sentence-like ASR segment, pass asr_params=ASRParams(segment_audio=True) to .extract_audio(...). This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model.

    Tip

    For more Python examples, refer to Python Quick Start Guide.

Parakeet with hosted inference (build.nvidia.com)

Instead of running the pipeline locally, you can call Parakeet through build.nvidia.com hosted inference.

  1. On the Parakeet model page on build.nvidia.com, create or copy an API key and note the function ID for hosted access. You need both before making API calls.

  2. Run inference from Python with the hosted gRPC endpoint and credentials from that page (the example below uses the default hosted gRPC hostname; confirm values in the Get API Key flow for your deployment).

    • The Ingestor object initializes the ingestion process.
    • The files method specifies the input files to process.
    • The extract_audio method runs audio extraction.
    • The hosted gRPC endpoint, function ID, and API key are routed through ASRParams. Pass them via asr_params=ASRParams(...); the ASR actor reads audio_endpoints, function_id, and auth_token from that object.
    from nemo_retriever.params.models import ASRParams
    
    ingestor = (
        Ingestor()
        .files("./data/*.mp3")
        .extract_audio(
            asr_params=ASRParams(
                audio_endpoints=("grpc.nvcf.nvidia.com:443", None),  # (grpc_endpoint, http_endpoint)
                function_id="<function ID>",
                auth_token="<API key>",
                segment_audio=True,
            ),
        )
    )
    

    Tip

    For more Python examples, refer to Python Quick Start Guide.

Video and frame OCR

For video assets, NeMo Retriever Library can combine audio or speech processing (see Speech and audio extraction above) with visual text extraction when OCR applies to frames or derived images.

For OCR-oriented extract methods on scanned or image-heavy content, see OCR and scanned documents, text and layout extraction, and Nemotron Parse for advanced visual parsing.

Container formats and early-access video types are listed under supported file types and formats (see What is NeMo Retriever Library? for the full list).

For end-to-end RAG stacks that include multimodal ingestion, see the NVIDIA AI Blueprints catalog and related solution pages on NVIDIA Build.