Enable Audio Ingestion Support for NVIDIA RAG Blueprint#
Enabling audio ingestion support allows the NVIDIA RAG Blueprint system to process and transcribe audio files (.mp3, .wav, .mp4, .avi, .mov and .mkv) during document ingestion. This enables better search and retrieval capabilities for audio content in your documents.
After you have deployed the blueprint, to enable audio ingestion support, follow these steps:
Using on-prem audio transcription model#
Docker Compose Flow#
Deploy the audio transcription model on-prem. You need a GPU to deploy this model. For a list of supported GPUs, see NVIDIA Riva ASR Support Matrix.
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile audio up -d
Make sure the audio container is up and running
docker ps --filter "name=audio" --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
Example Output
NAMES STATUS compose-audio-1 Up 5 minutes (healthy)
The ingestor-server is already configured to handle audio files. You can now ingest audio files (.mp3, .wav, .mp4, .avi, .mov or .mkv) using the ingestion API as shown in the ingestion API usage notebook.
Example usage with the ingestion API:
FILEPATHS = [ '../data/audio/sample.mp3', '../data/audio/sample.wav' ] await upload_documents(collection_name="audio_data")
Note
The audio transcription service requires GPU resources. Make sure you have sufficient GPU resources available before enabling this feature.
Customizing GPU Usage for Audio Service (Optional)#
By default, the audio service uses GPU ID 0. You can customize which GPU to use by setting the AUDIO_MS_GPU_ID environment variable before starting the service:
export AUDIO_MS_GPU_ID=3 # Use GPU 3 instead of GPU 0
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile audio up -d
Alternatively, you can modify the nims.yaml file directly to change the GPU assignment:
# In deploy/compose/nims.yaml, locate the audio service and modify:
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["${AUDIO_MS_GPU_ID:-0}"] # Change 0 to your desired GPU ID
capabilities: [gpu]
Note
Ensure the specified GPU is available and has sufficient memory for the audio transcription model. The Riva ASR model typically requires at least 8GB of GPU memory.
Helm Flow#
If you’re using Helm for deployment, follow these steps to enable audio ingestion:
Modify
values.yamlto enable audio ingestion:# Enable audio NIM service nv-ingest: nimOperator: audio: enabled: true envVars: # ... existing configurations ... # Ensure audio extraction dependencies are installed INSTALL_AUDIO_EXTRACTION_DEPS: "true"
Apply the updated Helm chart:
After modifying
values.yaml, apply the changes as described in Change a Deployment.For detailed HELM deployment instructions, see Helm Deployment Guide.
Verify that the audio pod is running:
kubectl get pods -n rag | grep audio
Output:
audio-pod 1/1 Running 0 3m29s
Check the audio service:
kubectl get svc -n rag | grep audio
Output:
audio ClusterIP 10.103.184.78 <none> 9000/TCP,50051/TCP 4m27s
Check the NIMService status:
kubectl get nimservice -n rag | grep audio
Output:
audio Ready 4m30s
Important
When using Helm deployment, the Audio NIM service requires an additional GPU.
Audio Segmentation:#
The APP_NVINGEST_SEGMENTAUDIO environment variable controls whether audio segmentation is enabled during the ingestion process.
When set to True, NeMo Retriever Library will segment audio files based on commas and other punctuation marks, resulting in more granular audio chunks. This can improve downstream processing and retrieval accuracy for audio content. Note that splitting on captions will occur regardless of this setting; enabling APP_NVINGEST_SEGMENTAUDIO simply adds additional segmentation based on punctuation.
To enable audio segmentation, add the following export command to your environment configuration:
export APP_NVINGEST_SEGMENTAUDIO=True