Skip to content

Extract Speech with NeMo Retriever Library

This documentation describes two methods to run NeMo Retriever Library with the parakeet-1-1b-ctc-en-us ASR NIM microservice (nvcr.io/nim/nvidia/parakeet-1-1b-ctc-en-us) to extract speech from audio files.

  • Run the NIM locally by using Docker Compose
  • Use NVIDIA Cloud Functions (NVCF) endpoints for cloud-based inference

Note

NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.

Currently, you can extract speech from the following file types:

  • mp3
  • wav

Overview

NeMo Retriever Library supports extracting speech from audio files for Retrieval Augmented Generation (RAG) applications. Similar to how the multimodal document extraction pipeline leverages object detection and image OCR microservices, NeMo Retriever leverages the parakeet-1-1b-ctc-en-us ASR NIM microservice to transcribe speech to text, which is then embedded by using the NeMo Retriever embedding NIM.

Important

Due to limitations in available VRAM controls in the current release, the parakeet-1-1b-ctc-en-us ASR NIM microservice must run on a dedicated additional GPU. For the full list of requirements, refer to Support Matrix.

This pipeline enables users to retrieve speech files at the segment level.

Overview diagram

Run the NIM Locally by Using Docker Compose

Use the following procedure to run the NIM locally.

Important

The parakeet-1-1b-ctc-en-us ASR NIM microservice must run on a dedicated additional GPU. Edit docker-compose.yaml to set the device_id to a dedicated GPU: device_ids: ["1"] or higher.

  1. To access the required container images, log in to the NVIDIA Container Registry (nvcr.io). Use your NGC key as the password. Run the following command in your terminal.

    • Replace <your-ngc-key> with your actual NGC API key.
    • The username is always $oauthtoken.
    $ docker login nvcr.io
    Username: $oauthtoken
    Password: <your-ngc-key>
    
  2. For convenience and security, store your NGC key in an environment variable file (.env). This enables services to access it without needing to enter the key manually each time. Create a .env file in your working directory and add the following line. Replace <your-ngc-key> with your actual NGC key.

    NGC_API_KEY=<your-ngc-key>
    
  3. Start the retriever services with the audio profile. This profile includes the necessary components for audio processing. Use the following command. The --profile audio flag ensures that speech-specific services are launched. For more information, refer to Profile Information.

    docker compose --profile retrieval --profile audio up
    
  4. After the services are running, you can interact with the pipeline by using Python.

    • The Ingestor object initializes the ingestion process.
    • The files method specifies the input files to process.
    • The extract method tells the pipeline to extract information from WAV audio files.
    • The document_type parameter is optional, because Ingestor should detect the file type automatically.

    ingestor = (
        Ingestor()
        .files("./data/*.wav")
        .extract(
            document_type="wav",  # Ingestor should detect type automatically in most cases
            extract_method="audio",
            extract_audio_params={
                "segment_audio": True,
            },
        )
    )
    
    To generate one extracted element for each sentence-like ASR segment, include extract_audio_params={"segment_audio": True} when calling .extract(...). This option applies when audio extraction runs with a Parakeet NIM (either locally through Docker or remotely via NVCF) but has no effect when using the local Hugging Face Parakeet model.

    Tip

    For more Python examples, refer to NV-Ingest: Python Client Quick Start Guide.

Use NVCF Endpoints for Cloud-Based Inference

Instead of running the pipeline locally, you can use NVCF to perform inference by using remote endpoints.

  1. NVCF requires an authentication token and a function ID for access. Ensure you have these credentials ready before making API calls.

  2. Run inference by using Python. Provide an NVCF endpoint along with authentication details.

    • The Ingestor object initializes the ingestion process.
    • The files method specifies the input files to process.
    • The extract method tells the pipeline to extract information from WAV audio files.
    • The document_type parameter is optional, because Ingestor should detect the file type automatically.
    ingestor = (
        Ingestor()
        .files("./data/*.mp3")
        .extract(
            document_type="mp3",
            extract_method="audio",
            extract_audio_params={
                "grpc_endpoint": "grpc.nvcf.nvidia.com:443",
                "auth_token": "<API key>",
                "function_id": "<function ID>",
                "use_ssl": True,
                "segment_audio": True,
            },
        )
    )
    

    Tip

    For more Python examples, refer to NV-Ingest: Python Client Quick Start Guide.