Enable Audio Ingestion Support for NVIDIA RAG Blueprint#

Enabling audio ingestion support allows the NVIDIA RAG Blueprint system to process and transcribe audio files (.mp3, .wav, .mp4, .avi, .mov and .mkv) during document ingestion. This enables better search and retrieval capabilities for audio content in your documents.

After you have deployed the blueprint, to enable audio ingestion support, follow these steps:

Using on-prem audio transcription model#

Docker Compose Flow#

  1. Deploy the audio transcription model on-prem. You need a GPU to deploy this model. For a list of supported GPUs, see NVIDIA Riva ASR Support Matrix.

    USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile audio up -d
    
  2. Make sure the audio container is up and running

    docker ps --filter "name=audio" --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
    

    Example Output

    NAMES                                   STATUS
    compose-audio-1                         Up 5 minutes (healthy)
    
  3. The ingestor-server is already configured to handle audio files. You can now ingest audio files (.mp3, .wav, .mp4, .avi, .mov or .mkv) using the ingestion API as shown in the ingestion API usage notebook.

    Example usage with the ingestion API:

    FILEPATHS = [
        '../data/audio/sample.mp3',
        '../data/audio/sample.wav'
    ]
    
    await upload_documents(collection_name="audio_data")
    

Note

The audio transcription service requires GPU resources. Make sure you have sufficient GPU resources available before enabling this feature.

Customizing GPU Usage for Audio Service (Optional)#

By default, the audio service uses GPU ID 0. You can customize which GPU to use by setting the AUDIO_MS_GPU_ID environment variable before starting the service:

export AUDIO_MS_GPU_ID=3  # Use GPU 3 instead of GPU 0
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile audio up -d

Alternatively, you can modify the nims.yaml file directly to change the GPU assignment:

# In deploy/compose/nims.yaml, locate the audio service and modify:
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          device_ids: ["${AUDIO_MS_GPU_ID:-0}"]  # Change 0 to your desired GPU ID
          capabilities: [gpu]

Note

Ensure the specified GPU is available and has sufficient memory for the audio transcription model. The Riva ASR model typically requires at least 8GB of GPU memory.

Helm Flow#

If you’re using Helm for deployment, follow these steps to enable audio ingestion:

  1. Modify values.yaml to enable audio ingestion:

    # Enable audio NIM service
    nv-ingest:
      nimOperator:
        audio:
          enabled: true
      
      envVars:
        # ... existing configurations ...
        
        # Ensure audio extraction dependencies are installed
        INSTALL_AUDIO_EXTRACTION_DEPS: "true"
    
  2. Apply the updated Helm chart:

    After modifying values.yaml, apply the changes as described in Change a Deployment.

    For detailed HELM deployment instructions, see Helm Deployment Guide.

  3. Verify that the audio pod is running:

    kubectl get pods -n rag | grep audio
    

    Output:

       audio-pod                                         1/1     Running   0             3m29s
    

    Check the audio service:

    kubectl get svc -n rag | grep audio
    

    Output:

       audio                           ClusterIP   10.103.184.78    <none>        9000/TCP,50051/TCP   4m27s
    

    Check the NIMService status:

    kubectl get nimservice -n rag | grep audio
    

    Output:

       audio                               Ready      4m30s
    

Important

When using Helm deployment, the Audio NIM service requires an additional GPU.

Audio Segmentation:#

The APP_NVINGEST_SEGMENTAUDIO environment variable controls whether audio segmentation is enabled during the ingestion process.

When set to True, NV-Ingest will segment audio files based on commas and other punctuation marks, resulting in more granular audio chunks. This can improve downstream processing and retrieval accuracy for audio content. Note that splitting on captions will occur regardless of this setting; enabling APP_NVINGEST_SEGMENTAUDIO simply adds additional segmentation based on punctuation.

To enable audio segmentation, add the following export command to your environment configuration:

export APP_NVINGEST_SEGMENTAUDIO=True