Release Notes#

v1.3.16#

Audio2Face and Audio2Emotion were merged into a single SDK.

Added optional tongue blendshape support in the gRPC animation output. For an example of how to use this, see Flexible Configuration Management section in Audio2Face-3D NIM Container Deployment and Configuration Guide.
Added TLS/mTLS support for secure gRPC communication.
Added One-click deployment for Azure and AWS.
Added auto-profile selection and a command to list available profiles. For more information, see the Optimized Configurations section under Support Matrix.

Using mark_v2.3 instead of james_v2.3 or claire_v2.3 may lead to CUDA out of memory errors when deploying with the same number of streams, since mark_v2.3 uses more memory than the other two models. This issue has been observed on GPUs with lower VRAM, such as the 5080.
When starting the Audio2Face-3D NIM container, version number is printed as Audio2Face-3D GA 1.3.14 instead of the correct version of 1.3.15. There is no effect on functionality.

The following items are deprecated and will be removed in the next release:

The unidirectional gRPC endpoints are deprecated and will be removed. The bidirectional gRPC endpoint will be the only supported endpoint for inference.
The 1.0 version of Audio2Face-3D Microservice is deprecated and will be removed.
The Python wheel used in our sample application will move to the PyPI repository. We will stop maintaining it at NVIDIA/Audio2Face-3D-Samples

The new service is now available as a downloadable NIM, seamlessly integrating into the NVIDIA NIM ecosystem.
New James 2.3 inference model provides better lip sync quality, stronger upperface expression for different emotions and less lip stretch artifact during silence.
New Claire 2.3 inference model provides better lip sync quality including F V M B P U S sounds and stronger upperface expression for different emotions.
New Mark 2.3 inference model provides better lip sync quality including F V M B P U S sounds.
Introduced support for bidirectional streaming with gRPC, enabling real-time communication between clients and the service while eliminating the need for the previously required A2F Controller.
Added runtime control for clamping blendshape values between 0 and 1.
Integrated OpenTelemetry for advanced observability, providing unified tracing and metrics.
Added functionality to download pre-built TensorRT (TRT) engines from NVCF, reducing service setup complexity.
Introduced an experimental gRPC endpoint for exporting configurations for a running service instance.
Updated the logging system to output application logs in structured JSON format.

New Claire 1.3 inference model provides enhanced lip movement and better accuracy for P and M sounds.
New Mark 2.2 inference model provides better lip sync and facial performance quality when used with Metahuman characters.
Users can now specify preferred emotions, enabling personalized outputs tailored to specific applications such as interactive avatars and virtual assistants.
Added emotional output to the microservice to help align other downstream animation components.
New output audio sampling rates supported in addition to 16kHz: 22.05kHz, 44.1kHz, 48kHz.
Added the ability to tune each stream at runtime with unique face parameters, emotions parameters, blendshape multipliers, and blendshape offsets.

Improved the gRPC protocol to use less data and provide a more efficient stream for scalability. USD parser is no longer required.
Improved blendshape solve threading to improve scalability.