Audio2Face-3D Microservice#

Bidirectional gRPC Communication#

Input#

The A2F-3D microservice receives its data from a bidirectional streaming gRPC. The data is composed of:

  1. an audio stream header containing information about the upcoming audio data, as well face parameters, post-processing options, blendshape parameters and emotion parameters.

  2. audio data as well as emotion data with time code to start applying the emotion

Inside the sample runtime configuration file for the gRPC interactions we documented the different parameters that can be used as part of the gRPC request. You can find it on our Github repo. We also provide an example below:

Warning

This configuration file is a runtime configuration file. Although it looks similar to the deployment-time one, runtime and deployment-time configuration files should not be mistaken. There is a case difference between the 2 config files (snake_case VS camelCase) and structure difference. Please refer to our examples for guidance in A2F-3D NIM Manual Container Deployment and Configuration.

config_james.yml
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Parameters related to the face
face_parameters:
  # Controls the range of motion on the upper regions of the face
  # Min value: 0.0
  # Max value: 2.0
  upperFaceStrength: 1
  # Applies temporal smoothing to the upper face motion
  # Min value: 0.0
  # Max value: 0.1
  upperFaceSmoothing: 0.001
  # Controls the range of motion on the lower regions of the face1
  # Min value: 0.0
  # Max value: 2.0
  lowerFaceStrength: 1.2
  # Applies temporal smoothing to the lower face motion
  # Min value: 0.0
  # Max value: 0.1
  lowerFaceSmoothing: 0.006
  # Determines the boundary between the upper and lower regions of the face
  # Min value: 0.0
  # Max value: 1.0
  faceMaskLevel: 0.6
  # Determines how smoothly the upper and lower face regions blend on the boundary
  # Min value: 0.001
  # Max value: 0.5
  faceMaskSoftness: 0.0085
  # Controls the range of motion of the skin.
  # Min value: 0.0
  # Max value: 2.0
  skinStrength: 1.0
  # Adjusts the default pose of eyelid open-close
  # Min value: -1.0
  # Max value: 1.0
  eyelidOpenOffset: 0.06
  # Adjusts the default pose of lip close-open
  # Min value: -0.02
  # Max value: 0.2
  lipOpenOffset: -0.02

# contains multipliers and offsets to be applied on the inference result
# For more information about blendshapes see:
# https://developer.apple.com/documentation/arkit/arfaceanchor/blendshapelocation
blendshape_parameters:
  enable_clamping_bs_weight: False
  multipliers:
    EyeBlinkLeft: 1.0
    EyeLookDownLeft: 0.0
    EyeLookInLeft: 0.0
    EyeLookOutLeft: 0.0
    EyeLookUpLeft: 0.0
    EyeSquintLeft: 1.0
    EyeWideLeft: 1.0
    EyeBlinkRight: 1.0
    EyeLookDownRight: 0.0
    EyeLookInRight: 0.0
    EyeLookOutRight: 0.0
    EyeLookUpRight: 0.0
    EyeSquintRight: 1.0
    EyeWideRight: 1.0
    JawForward: 0.7
    JawLeft: 0.2
    JawRight: 0.2
    JawOpen: 0.8
    MouthClose: 0.3
    MouthFunnel: 1.0
    MouthPucker: 1.0
    MouthLeft: 0.2
    MouthRight: 0.2
    MouthSmileLeft: 1.2
    MouthSmileRight: 1.2
    MouthFrownLeft: 0.5
    MouthFrownRight: 0.5
    MouthDimpleLeft: 0.8
    MouthDimpleRight: 0.8
    MouthStretchLeft: 0.05
    MouthStretchRight: 0.05
    MouthRollLower: 0.8
    MouthRollUpper: 0.5
    MouthShrugLower: 1.0
    MouthShrugUpper: 0.4
    MouthPressLeft: 0.8
    MouthPressRight: 0.8
    MouthLowerDownLeft: 0.8
    MouthLowerDownRight: 0.8
    MouthUpperUpLeft: 0.8
    MouthUpperUpRight: 0.8
    BrowDownLeft: 1.2
    BrowDownRight: 1.2
    BrowInnerUp: 1.3
    BrowOuterUpLeft: 0.8
    BrowOuterUpRight: 0.8
    CheekPuff: 0.2
    CheekSquintLeft: 1.0
    CheekSquintRight: 1.0
    NoseSneerLeft: 0.8
    NoseSneerRight: 0.8
    TongueOut: 0.0
  offsets:
    EyeBlinkLeft: 0.0
    EyeLookDownLeft: 0.0
    EyeLookInLeft: 0.0
    EyeLookOutLeft: 0.0
    EyeLookUpLeft: 0.0
    EyeSquintLeft: 0.0
    EyeWideLeft: 0.0
    EyeBlinkRight: 0.0
    EyeLookDownRight: 0.0
    EyeLookInRight: 0.0
    EyeLookOutRight: 0.0
    EyeLookUpRight: 0.0
    EyeSquintRight: 0.0
    EyeWideRight: 0.0
    JawForward: 0.0
    JawLeft: 0.0
    JawRight: 0.0
    JawOpen: 0.0
    MouthClose: 0.0
    MouthFunnel: 0.0
    MouthPucker: 0.0
    MouthLeft: 0.0
    MouthRight: 0.0
    MouthSmileLeft: 0.0
    MouthSmileRight: 0.0
    MouthFrownLeft: 0.0
    MouthFrownRight: 0.0
    MouthDimpleLeft: 0.0
    MouthDimpleRight: 0.0
    MouthStretchLeft: 0.0
    MouthStretchRight: 0.0
    MouthRollLower: 0.0
    MouthRollUpper: 0.0
    MouthShrugLower: 0.0
    MouthShrugUpper: 0.0
    MouthPressLeft: 0.0
    MouthPressRight: 0.0
    MouthLowerDownLeft: 0.0
    MouthLowerDownRight: 0.0
    MouthUpperUpLeft: 0.0
    MouthUpperUpRight: 0.0
    BrowDownLeft: 0.0
    BrowDownRight: 0.0
    BrowInnerUp: 0.0
    BrowOuterUpLeft: 0.0
    BrowOuterUpRight: 0.0
    CheekPuff: 0.0
    CheekSquintLeft: 0.0
    CheekSquintRight: 0.0
    NoseSneerLeft: 0.0
    NoseSneerRight: 0.0
    TongueOut: 0.0

# Parameter related to temporal smoothing of emotion
# Expected value range (0, inf]
live_transition_time: 0.0001

# Parameter that sets the beginning emotion in A2E engine
# Can provide any subset of the following params
# Set 'enable_preferred_emotion=false' when testing this
beginning_emotion:
  amazement: 1.0
  anger: 0.0
  # cheekiness: 0.0
  disgust: 0.0
  fear: 1.0
  # grief: 0.0
  # joy: 0.0
  outofbreath: 0.0
  pain: 1.0
  sadness: 0.0

# Parameters related to the post-processing of emotions.
post_processing_parameters:
  # Increases the spread between emotion values by pushing them higher or lower
  emotion_contrast: 1.0
  # Coefficient for smoothing emotions over time
  live_blend_coef: 0.7
  # Tells the A2F pipeline whether to use emotions weights defined under this sections
  enable_preferred_emotion: false
  # Sets the strength of the preferred emotion (if is loaded) relative to generated emotions
  preferred_emotion_strength: 0.5
  # Sets the strength of emotions relative to neutral emotion
  emotion_strength: 0.6
  # Sets a firm limit on the quantity of emotion sliders engaged by A2E - emotions with highest weight will be prioritized
  max_emotions: 3

# Manual emotion input
# Here there is a list of `emotion with timecode`
# You can add more of them if needed.
# An emotion with timecode is an emotion that is applied at a specific timecode
# Emotions must be between 0.0 and 1.0
# Timecodes must be >= 0.0
emotion_with_timecode_list:
  # this first emotion with timecode will apply joy at the very beginning of the
  # audio clip
  emotion_with_timecode1:
    time_code: 0.0
    emotions:
      amazement: 0.0
      anger: 0.0
      cheekiness: 0.0
      disgust: 0.0
      fear: 0.0
      grief: 0.0
      joy: 1.0
      outofbreath: 0.0
      pain: 0.0
      sadness: 0.0
  # this second emotion with timecode will apply fear after 1 second of audio in the
  # audio clip
  emotion_with_timecode2:
    time_code: 1.0
    emotions:
      amazement: 0.0
      anger: 0.0
      cheekiness: 0.0
      disgust: 0.0
      fear: 1.0
      grief: 0.0
      joy: 0.0
      outofbreath: 0.0
      pain: 0.1
      sadness: 0.0

Currently, we only support mono 16-bits PCM audio format. We do support arbitrary sample rate but advise on using 16kHz for best performance. Minimum and maximum accepted sample rates are defined in the configuration files of the microservice, but we cannot guarantee good results outside of the default values.

Output#

The A2F-3D microservice outputs on the bidirectional streaming gRPC:

  1. an animation header containing information about the blendshape names, audio output format, etc.

  2. blendshape data with time code in sync with audio data.

Detailed description of the gRPC protocol buffers in the grpc protocol buffers section.

Frame Rate

Audio2Face-3D Microservice perform 30 inferences per second of audio.

Which means playing the output data from Audio2Face-3D must happen at 30 FPS.

However the processing speed of Audio2Face-3D is not limited to 30 inferences per second of compute. For example if a stream indicates 300 FPS in the Audio2Face-3D logs then it means that 10 seconds (300 frames / 30FPS) of audio are processed per second of compute.

So, when receiving data from Audio2Face-3D, you need buffer it to replay it at 30FPS. The high output rate prevents jitter caused by network instabilities.

Currently in Audio2Face-3D, FPS logs are printed to stdout 2 times per second.

Note

Note that while higher stream rates can introduce a brief initial latency due to the extra buffering required, this latency does not persist during playback.

Blendshapes

Audio2Face-3D outputs blend shapes. See ARKit blendShape documentation for more information.

Audio2Face-3D does not animate head, tongue and eye movement.

The following blend shape values will always be 0 in the Audio2Face-3D output:

  • EyeLookDownRight

  • EyeLookInRight

  • EyeLookOutRight

  • EyeLookUpRight

  • EyeLookDownLeft

  • EyeLookInLeft

  • EyeLookOutLeft

  • EyeLookUpLeft

  • TongueOut

  • HeadRoll

  • HeadPitch

  • HeadYaw

Note

The definition of the blendshape MouthClose deviates from the standard ARKit version. The shape includes the opening of the jaw.

Legacy Communication

Note

While this type of communication is supported, we recommend first time users of Audio2Face-3D to use the bidirectional endpoint.

In this mode, the bidirectional streaming gRPC is split in two client streaming gRPCs where the Audio2Face-3D is both a server and a client.

Input

The A2F-3D microservice receives its data from a client-streaming RPC. The data is composed of:

  1. an audio stream header containing information about the upcoming audio data, as well face parameters, post-processing options, blendshape parameters and emotion parameters.

  2. audio data as well as emotion data with time code to start applying the emotion

Output

The A2F-3D microservice outputs its data as a client-streaming RPC. The output data is composed of:

  1. an animation header containing information about the blendshape names, audio output format, etc.

  2. blendshape data with time code in sync with audio data.

Detailed description of the gRPC protocol buffers in the grpc protocol buffers section.

Configuration#

See the dedicated Configuration page.