Getting Started#
Prerequisites#
Check the Support Matrix to make sure that you have the supported hardware and software stack.
NGC Authentication#
Generate an API key#
To access NGC resources, you need an NGC API key. You can generate a key here: Generate Personal Key.
When creating an NGC API Personal key, ensure that at least “NGC Catalog” is selected from the “Services Included” dropdown. More Services can be included if this key is to be reused for other purposes.
Note
Personal keys allow you to configure an expiration date, revoke or delete the key using an action button, and rotate the key as needed. For more information about key types, please refer the NGC User Guide.
Export the API key#
Pass the value of the API key to the docker run
command in the next section as the NGC_API_KEY
environment variable to download the appropriate models and resources when starting the NIM.
If you’re not familiar with how to create the NGC_API_KEY
environment variable, the simplest way is to export it in your terminal:
export NGC_API_KEY=<value>
Run one of the following commands to make the key available at startup:
# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc
# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc
Note
Other, more secure options include saving the value in a file, so that you can retrieve with cat $NGC_API_KEY_FILE
, or using a password manager.
Docker Login to NGC#
To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
Use $oauthtoken
as the username and NGC_API_KEY
as the password. The $oauthtoken
username is a special name that indicates that you will authenticate with an API key and not a user name and password.
Launching the NIM#
Supported models are available in two formats:
Pre-generated: These optimized models use TensorRT. You can download and use them directly on the corresponding GPU. Choose this model format if it is available for your GPU.
RMIR/Generic Model: This intermediate model requires an additional deployment step before you can use it. You can optimize this model with TensorRT and deploy it on any supported GPU. Choose this model format if a pre-generated model is not available for your GPU.
The following table shows how to launch a container for various models.
Note
Refer to the table of Supported Models and specify NIM_MANIFEST_PROFILE
according to the selected model and target GPU.
Download the pre-generated model and start the NIM.
# Set the appropriate profile for the pre-generated model.
export NIM_MANIFEST_PROFILE=<nim_manifest_profile>
# Deploy the pre-generated optimized model.
docker run -it --rm --name=riva-speech \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_MANIFEST_PROFILE \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
nvcr.io/nim/nvidia/riva-speech:1.2.0
On startup, the container downloads the pre-generated model from NGC. You can skip this download step on future runs by caching the model locally using a cache directory.
# Create the cache directory on the host machine.
export LOCAL_NIM_CACHE=~/.cache/nim_asr
mkdir -p "$LOCAL_NIM_CACHE"
chmod 777 $LOCAL_NIM_CACHE
# Set the appropriate profile for the pre-generated model.
export NIM_MANIFEST_PROFILE=<nim_manifest_profile>
# Deploy the pre-generated optimized model. The model is stored in the cache directory.
docker run -it --rm --name=riva-speech \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_MANIFEST_PROFILE \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
nvcr.io/nim/nvidia/riva-speech:1.2.0
Once the initial deployment is successful, subsequent deployments can be performed using the same command to leverage the cached models in the cache directory.
Caution
When using model cache, if you change NIM_MANIFEST_PROFILE
to load different model, then ensure to clear the contents of the cache directory on host machine before starting the NIM container. This will ensure that only the requested model profile is loaded.
Download the RMIR model, optimize it using TensorRT for the target GPU, and then start the NIM.
# Set the appropriate profile for the RMIR model.
export NIM_MANIFEST_PROFILE=<nim_manifest_profile>
# Generate optimized model using RMIR
docker run -it --rm --name=riva-speech \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_MANIFEST_PROFILE \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_OPTIMIZE=True \
nvcr.io/nim/nvidia/riva-speech:1.2.0
Download the RMIR model and optimize using TensorRT for target GPU. The container exits after model generation is complete.
# Create a directory to store the optimized model, then update the directory permissions.
mkdir exported_model
chmod 777 exported_model
# Set the appropriate profile for the RMIR model.
export NIM_MANIFEST_PROFILE=<nim_manifest_profile>
# Generate optimized model using RMIR
docker run -it --rm --name=riva-speech \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_MANIFEST_PROFILE \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_OPTIMIZE=True \
-v $PWD/exported_model:/export \
-e NIM_EXPORT_URL=/export \
nvcr.io/nim/nvidia/riva-speech:1.2.0
Start the NIM using the locally exported optimized model.
# Start NIM using optimized model
docker run -it --rm --name=riva-speech \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_MANIFEST_PROFILE \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-v $PWD/exported_model:/opt/nim/.cache \
-e NIM_DISABLE_MODEL_DOWNLOAD=True \
nvcr.io/nim/nvidia/riva-speech:1.2.0
Note
It may take up to 30 minutes for the Docker container to be ready and start accepting requests, depending on your network speed.
Supported Models#
Model |
Language |
Model format/GPU |
GPU Memory (GB) |
NIM_MANIFEST_PROFILE |
---|---|---|---|---|
en-US |
Pre-generated / H100 |
15 |
|
|
en-US |
RMIR / Generic |
15 |
|
|
es-US |
Pre-generated / H100 |
10 |
|
|
es-US |
RMIR / Generic |
10 |
|
|
Multilingual |
Pre-generated / H100 |
13 |
|
|
Multilingual |
RMIR / Generic |
13 |
|
Running Inference#
Open a new terminal and run the following command to check if the service is ready to handle inference requests:
curl -X 'GET' 'http://localhost:9000/v1/health/ready'
If the service is ready, you get a response similar to the following.
{"status":"ready"}
Install the Riva Python client.
Riva uses gRPC APIs. You can download proto files from Riva gRPC Proto files and compile them to a target language using Protoc compiler. You can find Riva clients in C++ and Python languages at the following locations.
Install Riva Python client
sudo apt-get install python3-pip
pip install -r https://raw.githubusercontent.com/nvidia-riva/python-clients/main/requirements.txt
pip install --force-reinstall git+https://github.com/nvidia-riva/python-clients.git
Download Riva sample client
git clone https://github.com/nvidia-riva/python-clients.git
Run Speech-to-Text (STT) inference.
Riva ASR supports Mono, 16-bit audio in WAV, OPUS and FLAC formats. In case you do not have a speech file available, you can use a sample speech file embedded in the Docker container launched in the previous section.
Transcription using gRPC API
# Copy the sample WAV file from the running NIM container to the host machine.
docker cp riva-speech:/opt/riva/wav/en-US_sample.wav .
# Streaming mode: Input speech file is streamed to the service chunk-by-chunk. Transcript is printed on the console.
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--language-code en-US --input-file en-US_sample.wav
# Offline mode: Input speech file is sent to the service in one shot. Transcript is printed on the console.
python3 python-clients/scripts/asr/transcribe_file_offline.py --server 0.0.0.0:50051 \
--language-code en-US --input-file en-US_sample.wav
Transcription using gRPC API
# Copy the sample WAV file from the running NIM container to the host machine.
docker cp riva-speech:/opt/riva/wav/es-US_sample.wav .
# Streaming mode: Input speech file is streamed to the service chunk-by-chunk. Transcript is printed on the console.
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--language-code es-US --input-file es-US_sample.wav
# Offline mode: Input speech file is sent to the service in one shot. Transcript is printed on the console.
python3 python-clients/scripts/asr/transcribe_file_offline.py --server 0.0.0.0:50051 \
--language-code es-US --input-file es-US_sample.wav
Whisper supports transcription in multiple languages. See Supported Languages for the list of all available languages and corresponding code. Specifying input language is optional but recommended as it will improve accuracy and latency.
Copy an example audio file or use your own.
# Copy the sample WAV file from the running NIM container to the host machine.
docker cp riva-speech:/opt/riva/wav/en-US_sample.wav .
Transcription using gRPC API
# Offline mode: Input speech file is sent to the service in one shot. Transcript is printed on the console.
python3 python-clients/scripts/asr/transcribe_file_offline.py --server 0.0.0.0:50051 \
--input-file en-US_sample.wav --custom-configuration source_language:en
Transcription using HTTP API
# Invoke the HTTP endpoint for transcription.
curl -s http://localhost:9000/v1/audio/transcriptions -F language=en -F file="@en-US_sample.wav"
Whisper supports translation from multiple languages to English language. See Supported Languages for the list of all available languages and corresponding code. Specifying input language is optional but recommended as it will improve accuracy and latency.
Copy an example audio file or use your own.
# Copy the sample WAV file from running the NIM container to the host machine.
docker cp riva-speech:/opt/riva/wav/es-US_sample.wav .
Translation to English using gRPC API
# Offline mode: Input speech file is sent to the service in one shot. Transcript is printed on the console.
python3 python-clients/scripts/asr/transcribe_file_offline.py --server 0.0.0.0:50051 \
--input-file es-US_sample.wav --custom-configuration source_language:es,task:translate
Translation to English using HTTP API
# Invoke the HTTP endpoint for translation.
curl -s http://localhost:9000/v1/audio/translations -F language=es -F file="@es-US_sample.wav"
Note
Refer to the Customization page for more information on customizing model behavior.
Runtime Parameters for the Container#
Flags |
Description |
---|---|
|
|
|
Delete the container after it stops (see Docker docs) |
|
Give a name to the NIM container. Use any preferred value. |
|
Ensure NVIDIA drivers are accessible in the container. |
|
Expose NVIDIA GPU 0 inside the container. If you are running on a host with multiple GPUs, you need to specify which GPU to use. See GPU Enumeration for further information on for mounting specific GPUs. |
|
Allocate host memory for multi-GPU communication. |
|
Provide the container with the token necessary to download adequate models and resources from NGC. See NGC Authentication. |
|
Specify the model to load. |
|
Specify the port to use for HTTP endpoint. Port can have any value except 8000. |
|
Specify the port to use for GRPC endpoint. |
|
Forward the port where the NIM HTTP server is published inside the container to access from the host system. The left-hand side of |
|
Forward the port where the NIM gRPC server is published inside the container to access from the host system. The left-hand side of |
Stopping the Container#
The following commands stop the container by stopping and removing the running docker container.
docker stop riva-speech
docker rm riva-speech