Release Notes#

v2.0#

Added blendshape streaming controls in advanced_config.yaml: pipeline_parameters.burst_mode and pipeline_parameters.blendshape_streaming_fps.
Updated default model variants (e.g., james_v2.3.1 and claire_v2.3.1).
Added diffusion model presets for Multi v3.2 (multi_v3.2_{james,claire,mark}) via PERF_A2F_MODEL (see Getting Started). Note: Diffusion models require more GPU memory than regression models, resulting in lower maximum batch sizes (see Support Matrix).
Added diffusion configuration option a2f.diffusion_model.constant_noise for controlling deterministic vs. non-deterministic diffusion inference.
Tongue blendshape output is configurable via enable_tongue_blendshapes; when enabled, A2F outputs 68 blendshape weights (52 face + 16 tongue) (see Audio2Face-3D Microservice).
Added GPU blendshape solving option via a2f.use_gpu_solver in advanced_config.yaml. When enabled (default), blendshape solving runs entirely on GPU, improving performance by avoiding CPU-GPU data transfers during the solve step.
Updated the Python interaction app wheel used in examples to nvidia_ace-1.2.0.
Logging config values were updated: err → error and off → disabled (see deployment_config.yaml).

Fixed HIGH severity CVEs in container dependencies:
- CVE-2025-68973: Upgraded gnupg packages
- GHSA-8rrh-rw8j-w5fx: Upgraded wheel to 0.46.2+
- GHSA-58pv-8j8x-9vj2: Upgraded setuptools to 75.0.0+
Updated nimtools stack: nimtools 1.6.1, nimlib 0.13.10

The legacy unidirectional streaming gRPC endpoints have been removed. Only the bidirectional streaming endpoint is supported for inference. The use_bidirectional and unidirectional fields in deployment_config.yaml are still parsed for backwards compatibility but have no effect.
Stylization configuration schema changed in v2.0 (see Audio2Face-3D NIM Container Deployment and Configuration Guide).
a2e.enabled: false no longer implies “all emotions are zero”. In v2.0 it disables audio-based (GPU) emotion inference, but the emotion post-processing path can still run (for example, to apply preferred/runtime emotion inputs). See Audio2Face-3D NIM Container Deployment and Configuration Guide.

GPU device ID mismatch warning in nim_list_model_profiles: When running nim_list_model_profiles, you may see “No compatible profiles found” with all profiles listed as “Incompatible with system”, even on supported GPUs like RTX 5090, RTX 5080, or RTX 6000. This warning does not prevent the NIM from running—the NIM includes fallback logic that matches by GPU name rather than device ID. See GPU Device ID Mismatch Warning (nim_list_model_profiles) for details and workarounds.
protobuf CVE (GHSA-7gcm-g887-7qv7): Cannot be fixed yet as the required protobuf version (5.29.6) does not exist. The current nimlib requires protobuf <6.0.0 but the latest 5.x is 5.29.5. This requires an upstream fix in nimlib.

Audio2Face and Audio2Emotion were merged into a single SDK.

Added optional tongue blendshape support in the gRPC animation output. For an example of how to use this, see Flexible Configuration Management section in Audio2Face-3D NIM Container Deployment and Configuration Guide.
Added TLS/mTLS support for secure gRPC communication.
Added One-click deployment for Azure and AWS.
Added auto-profile selection and a command to list available profiles. For more information, see the Optimized Configurations section under Support Matrix.

Using mark_v2.3 instead of james_v2.3 or claire_v2.3 may lead to CUDA out of memory errors when deploying with the same number of streams, since mark_v2.3 uses more memory than the other two models. This issue has been observed on GPUs with lower VRAM, such as the 5080.
When starting the Audio2Face-3D NIM container, version number is printed as Audio2Face-3D GA 1.3.14 instead of the correct version of 1.3.15. There is no effect on functionality.

The following items are deprecated and will be removed in the next release:

The unidirectional gRPC endpoints are deprecated and will be removed. The bidirectional gRPC endpoint will be the only supported endpoint for inference.
The 1.0 version of Audio2Face-3D Microservice is deprecated and will be removed.
The Python wheel used in our sample application will move to the PyPI repository. We will stop maintaining it at NVIDIA/Audio2Face-3D-Samples

The new service is now available as a downloadable NIM, seamlessly integrating into the NVIDIA NIM ecosystem.
New James 2.3 inference model provides better lip sync quality, stronger upperface expression for different emotions and less lip stretch artifact during silence.
New Claire 2.3 inference model provides better lip sync quality including F V M B P U S sounds and stronger upperface expression for different emotions.
New Mark 2.3 inference model provides better lip sync quality including F V M B P U S sounds.
Introduced support for bidirectional streaming with gRPC, enabling real-time communication between clients and the service while eliminating the need for the previously required A2F Controller.
Added runtime control for clamping blendshape values between 0 and 1. When enabled (enable_clamping_bs_weight: true), output weights are constrained to the [0.0, 1.0] range expected by most animation systems. Recommended for production use.
Integrated OpenTelemetry for advanced observability, providing unified tracing and metrics.
Added functionality to download pre-built TensorRT (TRT) engines from NVCF, reducing service setup complexity.
Introduced an experimental gRPC endpoint for exporting configurations for a running service instance.
Updated the logging system to output application logs in structured JSON format.

New Claire 1.3 inference model provides enhanced lip movement and better accuracy for P and M sounds.
New Mark 2.2 inference model provides better lip sync and facial performance quality when used with Metahuman characters.
Users can now specify preferred emotions, enabling personalized outputs tailored to specific applications such as interactive avatars and virtual assistants.
Added emotional output to the microservice to help align other downstream animation components.
New output audio sampling rates supported in addition to 16kHz: 22.05kHz, 44.1kHz, 48kHz.
Added the ability to tune each stream at runtime with unique face parameters, emotions parameters, blendshape multipliers, and blendshape offsets.

Improved the gRPC protocol to use less data and provide a more efficient stream for scalability. USD parser is no longer required.
Improved blendshape solve threading to improve scalability.