Extract Speech with NeMo Retriever Library
This documentation describes two methods to run NeMo Retriever Library
with the parakeet-1-1b-ctc-en-us ASR NIM microservice
(nvcr.io/nim/nvidia/parakeet-1-1b-ctc-en-us) to extract speech from audio files.
- Run the NIM locally by using Docker Compose
- Use NVIDIA Cloud Functions (NVCF) endpoints for cloud-based inference
Note
NVIDIA Ingest (nv-ingest) has been renamed NeMo Retriever Library.
Currently, you can extract speech from the following file types:
mp3wav
Overview
NeMo Retriever Library supports extracting speech from audio files for Retrieval Augmented Generation (RAG) applications. Similar to how the multimodal document extraction pipeline leverages object detection and image OCR microservices, NeMo Retriever leverages the parakeet-1-1b-ctc-en-us ASR NIM microservice to transcribe speech to text, which is then embedded by using the NeMo Retriever embedding NIM.
Important
Due to limitations in available VRAM controls in the current release, the parakeet-1-1b-ctc-en-us ASR NIM microservice must run on a dedicated additional GPU. For the full list of requirements, refer to Support Matrix.
This pipeline enables users to retrieve speech files at the segment level.

Run the NIM Locally by Using Docker Compose
Use the following procedure to run the NIM locally.
Important
The parakeet-1-1b-ctc-en-us ASR NIM microservice must run on a dedicated additional GPU. Edit docker-compose.yaml to set the device_id to a dedicated GPU: device_ids: ["1"] or higher.
-
To access the required container images, log in to the NVIDIA Container Registry (nvcr.io). Use your NGC key as the password. Run the following command in your terminal.
- Replace
<your-ngc-key>with your actual NGC API key. - The username is always
$oauthtoken.
$ docker login nvcr.io Username: $oauthtoken Password: <your-ngc-key> - Replace
-
For convenience and security, store your NGC key in an environment variable file (
.env). This enables services to access it without needing to enter the key manually each time. Create a .env file in your working directory and add the following line. Replace<your-ngc-key>with your actual NGC key.NGC_API_KEY=<your-ngc-key> -
Start the retriever services with the
audioprofile. This profile includes the necessary components for audio processing. Use the following command. The--profile audioflag ensures that speech-specific services are launched. For more information, refer to Profile Information.docker compose --profile retrieval --profile audio up -
After the services are running, you can interact with the pipeline by using Python.
- The
Ingestorobject initializes the ingestion process. - The
filesmethod specifies the input files to process. - The
extractmethod tells the pipeline to extract information from WAV audio files. - The
document_typeparameter is optional, becauseIngestorshould detect the file type automatically.
To generate one extracted element for each sentence-like ASR segment, includeingestor = ( Ingestor() .files("./data/*.wav") .extract( document_type="wav", # Ingestor should detect type automatically in most cases extract_method="audio", extract_audio_params={ "segment_audio": True, }, ) )extract_audio_params={"segment_audio": True}when calling.extract(...). This option applies when audio extraction runs with a Parakeet NIM (either locally through Docker or remotely via NVCF) but has no effect when using the local Hugging Face Parakeet model.Tip
For more Python examples, refer to NV-Ingest: Python Client Quick Start Guide.
- The
Use NVCF Endpoints for Cloud-Based Inference
Instead of running the pipeline locally, you can use NVCF to perform inference by using remote endpoints.
-
NVCF requires an authentication token and a function ID for access. Ensure you have these credentials ready before making API calls.
-
Run inference by using Python. Provide an NVCF endpoint along with authentication details.
- The
Ingestorobject initializes the ingestion process. - The
filesmethod specifies the input files to process. - The
extractmethod tells the pipeline to extract information from WAV audio files. - The
document_typeparameter is optional, becauseIngestorshould detect the file type automatically.
ingestor = ( Ingestor() .files("./data/*.mp3") .extract( document_type="mp3", extract_method="audio", extract_audio_params={ "grpc_endpoint": "grpc.nvcf.nvidia.com:443", "auth_token": "<API key>", "function_id": "<function ID>", "use_ssl": True, "segment_audio": True, }, ) )Tip
For more Python examples, refer to NV-Ingest: Python Client Quick Start Guide.
- The