How do I Use Speaker Diarization with Riva ASR?#

This tutorial walks you through the speaker diarization feature available with Riva ASR.

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will show how to use the speaker diarization feature with Riva ASR to get the transcript with each word tagged with the id of the speaker who has spoken that word.
To understand the basics of Riva ASR APIs, refer to Getting started with Riva ASR in Python.

For more information about Riva, refer to the Riva developer documentation.

Speaker Diarization with Riva ASR APIs#

Speaker Diarization is the process of segmenting audio recordings by speaker labels and aims to answer the question “who spoke when?” Riva ASR supports speaker diarization, which can be enabled by passing SpeakerDiarizationConfig with enable_speaker_diarization set to True. Riva speaker diarization segments input audio, extracts speaker embeddings of the segments, counts the number of speakers, and then assigns each audio segment a corresponding speaker tag. When speaker diarization is enabled, Riva ASR returns the ASR transcript to the client, along with a speaker tag for each word in the transcript. Speaker diarization is language agnostic and can work with any language supported by Riva ASR.

Note:#

Speaker diarization is only supported with Riva ASR offline API. Speaker diarization is an alpha release and will increase ASR latency once the feature is enabled. Refer to Performance for more information.

Requirements and Setup#

  1. Enable the speaker diarization model.
    Speaker Diarization is an optional model, so it needs to be enabled in config.sh by unncommenting the line containing the rmir_diarizer_offline word. Since speaker diarization works only with Riva ASR offline API, make sure that the offline ASR model is also enabled in config.sh.

  2. Deploy the models and start the Riva Speech Skills server.
    Deploy the models enabled in the previous step by running bash riva_init.sh and then start the Riva server by running bash riva_start.sh. Refer to the Riva Skills Quick Start Guide for more information.

  3. Install the Riva client library.
    Perform the steps in the requirements and setup for the Riva client section to install the Riva client library.

Import the Riva Client Libraries#

Let’s import some of the required libraries, including the Riva client libraries.

import io
import IPython.display as ipd
import grpc

import riva.client

Create a Riva Client and Connect to the Riva Speech API Server#

The following URI assumes a local deployment of the Riva Speech API server is on the default port. In case the server deployment is on a different host or via a Helm chart on Kubernetes, use an appropriate URI.

# Instantiate client
auth = riva.client.Auth(uri='localhost:50051')
riva_asr = riva.client.ASRService(auth)
# Load a sample audio file from local disk
# This example uses a .wav file with LINEAR_PCM encoding.
# Sample file taken from https://freesound.org/people/SamKolber/sounds/203020/
path = "audio_samples/interview-with-bill.wav"
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)
# Creating RecognitionConfig
config = riva.client.RecognitionConfig(
  language_code="en-US",
  max_alternatives=1,
  enable_automatic_punctuation=True,
  enable_word_time_offsets=True,
)

# Use utility function to add SpeakerDiarizationConfig with enable_speaker_diarization=True
# Value of max_speaker_count in SpeakerDiarizationConfig has no effect as of now. It will be honoured in future.
riva.client.asr.add_speaker_diarization_to_config(config, diarization_enable=True)

# ASR inference call with Recognize
response = riva_asr.offline_recognize(content, config)
print("ASR Transcript with Speaker Diarization:\n", response)

The ASR transcript is split into multiple results based on speech pauses. For each result, every word in the transcript is assigned a speaker tag indicating which speaker spoke that word.

# Pretty print transcript with color coded speaker tags. Black color text indicates no speaker tag was assigned.
for result in response.results:
    for word in result.alternatives[0].words:
        color = '\033['+ str(30 + word.speaker_tag) + 'm'
        print(color, word.word, end="")
      

This completes the tutorial for using speaker diarization with Riva ASR.

Go Deeper into Riva Capabilities#

Now that you have a basic introduction to the Riva ASR APIs, you can try:

Additional Riva Tutorials#

Checkout more Riva tutorials here to understand how to use some of the advanced features of Riva ASR, including customizing ASR for your specific needs.

Sample Applications#

Riva comes with various sample applications. They demonstrate how to use the APIs to build various applications. Refer to Riva Sampple Apps for more information.

Additional Resources#

For more information about each of the Riva APIs and their functionalities, refer to the documentation.