Sample application connecting to Audio2Face-3D#

A sample application is provided to demonstrate how to communicate with the Audio2Face-3D microservices. This python application will interact with A2F-3D NIM.

Assumptions#

The Audio2Face-3D NIM is up and running.

Setting up the sample application#

Clone the repository: NVIDIA/Audio2Face-3D-Samples

Go to scripts/audio2face_3d_microservices_interaction_app subfolder.

And follow the setup instructions in the README.md.

To check that the application was setup correctly and see how to use it, run:

$ python3 a2f_3d.py --help
usage: a2f_3d.py [-h] {health_check,run_inference} ...

Sample python3 application to send audio and receive animation data and emotion data through the A2F-3D pipeline.

positional arguments:
  {health_check,run_inference}
    health_check        Check GRPC service health
    run_inference       Send GRPC request and run inference for an audio file

options:
  -h, --help            show this help message and exit

NVIDIA CORPORATION. All rights reserved.

Health checking#

To check that Audio2Face-3D service is up and running run:

$ python3 a2f_3d.py health_check --url <ip>:<port>

Interacting with Audio2Face-3D#

This sample python application can be used as follows:

$ python3 a2f_3d.py run_inference <audio_file.wav> <config.yml> -u <ip>:<port> [--skip-print-to-files]

For example,

$ python3 a2f_3d.py run_inference audio.wav config_mark_v2.yml -u 127.0.0.1:52000
  • The script requires two parameters: an audio file in PCM 16-bit format and a YAML configuration file containing emotion parameters.

  • Additionally, it accepts a -u parameter for the A2F-3D NIM. For quick start deployment, use 127.0.0.1:52000.

  • To test the script, you’ll need to provide an audio file.

  • Optionally, you can choose to not print the results to files by enabling --skip-print-to-files.

  • Optionally, you can choose to print data for performance measurement by enabling --print-fps.

Results#

This will produce a folder in which 4 files are available. You can explore the results by running the following command, and replacing the <output_folder> with the name of the folder printed by the a2f_3d.py script:

$ ls -l <output_folder>/
-rw-rw-r-- 1 user user    328 Nov 14 15:46 a2f_3d_input_emotions.csv
-rw-rw-r-- 1 user user  65185 Nov 14 15:46 a2f_3d_smoothed_emotion_output.csv
-rw-rw-r-- 1 user user 291257 Nov 14 15:46 animation_frames.csv
-rw-rw-r-- 1 user user 406444 Nov 14 15:46 out.wav
  • out.wav: contains the audio received

  • animation_frames.csv: contains the blendshapes

  • a2f_3d_input_emotions.csv: contains the emotions provided as input in the gRPC protocol

  • a2f_3d_smoothed_emotion_output.csv: contains emotions smoothed over time

Note

In previous versions, this sample application generated a file named a2e_emotion_output.csv, which contained a blend of a2f_input_emotions and the emotions inferred by our Audio2Emotion models, prior to any post-processing or smoothing. In the new Audio2Face-3D microservice, these steps are integrated for optimal performance, so intermediate inferred emotions are no longer available.

Resampling Audio Data#

Audio2Face-3D performs audio processing using an optimal sample rate of 16 kHz. This is the recommended sample rate for the best performance and quality during inference.

Note

It is crucial to use audio data sampled at 16 kHz to achieve the most accurate results. Our system allows for both downsampling and upsampling directly within the audio processing pipeline; however, we strongly recommend against upsampling.

Resampling Guidelines#

  1. Optimal Sample Rate: The system is optimized for audio data at 16 kHz. This sample rate offers the best balance between performance and audio quality.

  2. Downsampling: If your audio data is sampled above 16 kHz, downsampling is necessary and can be handled directly by our processing pipeline. This ensures that your audio data matches the optimal sample rate for our application.

  3. Upsampling: We allow for upsampling in our pipeline if your audio is sampled below 16 kHz. However, please be aware that upsampling often leads to significant quality degradation. The interpolation required in upsampling can introduce artifacts and distortions that adversely affect the performance of the inference process.

Warning

Using audio data sampled at less than 16 kHz is not recommended. While our system supports upsampling, it may result in poor inference outcomes. For best results, always use or convert your audio to 16 kHz or higher before processing.

By adhering to these guidelines, you can ensure that your audio data is processed in the most effective and quality-preserving manner.