Notes

TRT Engine Generation

During the deployment process, a TRT engine will need to be generated to optimize the models for the given GPU. This TRT engine will need to be regenerated when deployment environment changed. This is especially the case when GPU changes are present, with a different architecture or compute capability. The generated TRT engine can potentially be reused on machines with the exact same controlled configuration (same hardware + docker). It is recommended to always regenerate the TRT engine whenever hardware changes are made.

Resampling Audio Data

Our application performs audio processing using an optimal sample rate of 16 kHz. This is the recommended sample rate for the best performance and quality during inference.

Note

It is crucial to use audio data sampled at 16 kHz to achieve the most accurate results. Our system allows for both downsampling and upsampling directly within the audio processing pipeline; however, we strongly recommend against upsampling.

Resampling Guidelines

  1. Optimal Sample Rate: The system is optimized for audio data at 16 kHz. This sample rate offers the best balance between performance and audio quality.

  2. Downsampling: If your audio data is sampled above 16 kHz, downsampling is necessary and can be handled directly by our processing pipeline. This ensures that your audio data matches the optimal sample rate for our application.

  3. Upsampling: We allow for upsampling in our pipeline if your audio is sampled below 16 kHz. However, please be aware that upsampling often leads to significant quality degradation. The interpolation required in upsampling can introduce artifacts and distortions that adversely affect the performance of the inference process.

Warning

Using audio data sampled at less than 16 kHz is not recommended. While our system supports upsampling, it may result in poor inference outcomes. For best results, always use or convert your audio to 16 kHz or higher before processing.

By adhering to these guidelines, you can ensure that your audio data is processed in the most effective and quality-preserving manner.

Known issues

  • Some emotion on Claire v1.3 have a lighter impact than others. (E.g.: Anger and cheekiness) You can set a higher upperFaceStrength to increase the emotion effect on the avatar.

  • Non-verbal human sounds (E.g. “hmmmm…”) and non-human audio do not translate well into facial expressions, resulting in random lip motions. This is an area identified for future improvement.

Q&A

The Avatar doesn’t close its mouth at the end of the audio clip. How can I fix it?

There are multiple requirements for the mouth to close, we suggest you to try in this order:

  • The audio clip need to end with some silence, you need to append some silence at the end if that’s not the case.

  • The emotions must be reset to neutral, you need to send a neutral emotion (E.g. send an empty map or joy=0) and some silent audio if that’s not the case.

Note

If you set the emotion to neutral at the very end of the audio clip, you will not see any change because some smoothing applies in Audio2Face to make the emotion change less abrupt. So you will need to push some silence audio at the end.

E.g.: In the Audio2Face NIM at the end of an audio clip, we set all emotions to 0 and send 1.5 seconds of silence.

If preferred_emotion_strength from EmotionPostProcessingParameters is close to zero then the neutral emotion will not be taken into account. You might need to increase that value to allow the mouth to close.

  • The face parameter selected might prevent the mouth from closing, you can try updating them to make sure the mouth closes.

  • The blend shape multipliers and offset might prevent the mouth from closing, you can try updating them to make sure the mouth closes.

gRPC Response Status Code

Users will receive a gRPC error message with the code nvidia_ace::status::v1::Status_Code_ERROR in the following cases, among others:

  • The gRPC request is missing a data field or audio header.

  • The audio header is invalid (E.g. unsupported audio channels, sample rate, bit depth).

  • The audio buffer is empty in the data field.

  • The id (request_id, stream_id, target_object_id) exceeds maxLenUUID or contains unsupported characters.

  • The number of concurrent streams has reached streamNumber limit.

  • The FPS is too low.

Otherwise, the gRPC request will return with the code nvidia_ace::status::v1::Status_Code_SUCCESS.