Enable Audio Ingestion Support for NVIDIA RAG Blueprint#
Enabling audio ingestion support allows the NVIDIA RAG Blueprint system to process and transcribe audio files (.mp3 and .wav) during document ingestion. This enables better search and retrieval capabilities for audio content in your documents.
After you have deployed the blueprint, to enable audio ingestion support, follow these steps:
Using on-prem audio transcription model#
Docker Compose Flow#
Deploy the audio transcription model on-prem. You need a GPU to deploy this model. For a list of supported GPUs, see NVIDIA Riva ASR Support Matrix.
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile audio up -d
Make sure the audio container is up and running
docker ps --filter "name=audio" --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
Example Output
NAMES STATUS compose-audio-1 Up 5 minutes (healthy)
The ingestor-server is already configured to handle audio files. You can now ingest audio files (.mp3 or .wav) using the ingestion API as shown in the ingestion API usage notebook.
Example usage with the ingestion API:
FILEPATHS = [ '../data/audio/sample.mp3', '../data/audio/sample.wav' ] await upload_documents(collection_name="audio_data")
Note
The audio transcription service requires GPU resources. Make sure you have sufficient GPU resources available before enabling this feature.
Customizing GPU Usage for Audio Service (Optional)#
By default, the audio service uses GPU ID 0. You can customize which GPU to use by setting the AUDIO_MS_GPU_ID environment variable before starting the service:
export AUDIO_MS_GPU_ID=3 # Use GPU 3 instead of GPU 0
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile audio up -d
Alternatively, you can modify the nims.yaml file directly to change the GPU assignment:
# In deploy/compose/nims.yaml, locate the audio service and modify:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["${AUDIO_MS_GPU_ID:-0}"] # Change 0 to your desired GPU ID
capabilities: [gpu]
Note
Ensure the specified GPU is available and has sufficient memory for the audio transcription model. The Riva ASR model typically requires at least 8GB of GPU memory.
Helm Flow#
If you’re using Helm for deployment, follow these steps to enable audio ingestion:
Enable Riva NIM by setting
nv-ingest.riva-nim.deployedtotruein values.yaml.nv-ingest: riva-nim: deployed: true
Verify that audio extraction dependencies are installed by setting
nv-ingest.envVars.INSTALL_AUDIO_EXTRACTION_DEPStotruein values.yaml.nv-ingest: envVars: INSTALL_AUDIO_EXTRACTION_DEPS: "true"
Apply the updated Helm chart by running the following code.
helm upgrade --install rag -n rag https://helm.ngc.nvidia.com/0648981100760671/charts/nvidia-blueprint-rag-v2.4.0-dev.tgz \ --username '$oauthtoken' \ --password "${NGC_API_KEY}" \ --set imagePullSecret.password=$NGC_API_KEY \ --set ngcApiSecret.password=$NGC_API_KEY \ -f deploy/helm/nvidia-blueprint-rag/values.yaml
Verify that the riva-nim pod is running:
kubectl get pods -n rag | grep riva-nim
Output:
nv-ingest-riva-nim-6578f4579f-4q75k 1/1 Running 0 3m29s
kubectl get svc -n rag | grep riva-nim
Output:
nv-ingest-riva-nim ClusterIP 10.103.184.78 <none> 9000/TCP,50051/TCP 4m27s
Important
When using Helm deployment, the Riva NIM service requires an additional H100 or B200 GPU making the total GPU requirement to 9xH100 without MIG slicing.
Audio Segmentation:#
The APP_NVINGEST_SEGMENTAUDIO environment variable controls whether audio segmentation is enabled during the ingestion process.
When set to True, NV-Ingest will segment audio files based on commas and other punctuation marks, resulting in more granular audio chunks. This can improve downstream processing and retrieval accuracy for audio content. Note that splitting on captions will occur regardless of this setting; enabling APP_NVINGEST_SEGMENTAUDIO simply adds additional segmentation based on punctuation.
To enable audio segmentation, add the following export command to your environment configuration:
export APP_NVINGEST_SEGMENTAUDIO=True