Audio2Face Microservice

Overview

The Audio2Face (A2F) microservice is a key component of our facial animation technology stack, designed to process audio input and generate corresponding facial animations. A2F integrates both server and client functionalities using gRPC to seamlessly handle data streams within a larger pipeline. This service can operate standalone or be coupled with the A2F Controller for enhanced usability through a bi-directional API.

Communication

Input

The A2F microservice receives its data from a client-streaming RPC. The data is composed of:

  1. an audio stream header containing information about the upcoming audio data, as well face parameters, post-processing options and blendshape parameters.

  2. audio data as well as emotion data with time code to start applying the emotion

Currently, we only support mono 16-bits PCM audio format. We do support arbitrary sample rate but advise on using 16kHz for best performance.

Face Parameters

The supported face parameters by Audio2Face are:

Parameter

Min

Max

Description

skinStrength

0.0

2.0

Controls the skin’s range of motion.

upperFaceStrength

0.0

2.0

Controls the range of the motion of the upper region of the face.

lowerFaceStrength

0.0

2.0

Controls the range of the motion of the lower region of the face.

eyelidOpenOffset

-1.0

1.0

Adjusts the default pose of the eyelid (-1.0 means fully closed. 1.0 means fully open.)

lipOpenOffset

-0.2

0.2

Adjusts the default pose of lip (-1.0 means fully closed. 1.0 means fully open.)

upperFaceSmoothing

0.0

0.1

Smooths the motions on the upper region of the face.

lowerFaceSmoothing

0.0

0.1

Smooths the motions on the lower region of the face.

faceMaskLevel

0.0

1.0

Determines the boundary between the upper and lower region of the face.

faceMaskSoftness

0.001

0.5

Determines how smoothly the upper and lower face regions blend on the mask boundary.

Additional parameters may appear occasionally in the configuration files; however, they do not impact the avatar’s facial expressions. Examples of such parameters include blinkStrength, tongueStrength, tongueHeightOffset, and tongueDepthOffset.

Output

The A2F microservice outputs its data as a client-streaming RPC. The output data is composed of:

  1. an animation header containing informations about the blendshape names, audio output format, etc.

  2. blendshape data with time code in sync with audio data.

Detailed description of the gRPC protoypes in the grpc prototypes section.

Frame Rate

Audio2Face Microservice perform 30 inferences per second of audio.

Which means playing the output data from Audio2Face must happen at 30 FPS.

However the processing speed of Audio2Face is not limited to 30 inferences per second of compute. For example if a stream indicates 300 FPS in the Audio2Face logs then it means that 10 seconds (300 frames / 30FPS) of audio are processed per second of compute.

So, when receiving data from Audio2Face, you need buffer it to replay it at 30FPS. The high output rate prevents jitter caused by network instabilities.

Currently in Audio2Face, FPS logs are printed to stdout 2 times per second.

Blendshapes

Audio2Face outputs blend shapes. See ARKit blendShape documentation for more information.

Audio2Face does not animate head, tongue and eye movement.

The following blend shape values will always be 0:

  • EyeLookDownRight

  • EyeLookInRight

  • EyeLookOutRight

  • EyeLookUpRight

  • EyeLookDownLeft

  • EyeLookInLeft

  • EyeLookOutLeft

  • EyeLookUpLeft

  • TongueOut

  • HeadRoll

  • HeadPitch

  • HeadYaw

in the Audio2Face output.

Note

The definition of the blendshape mouthClose deviates from the standard ARKit version. The shape includes the opening of the jaw.

Stream number

Audio2Face is performing batched inference to optimize compute and provide a stable framerate for all streams. When deploying the Microservice you will need to provide a stream number. If you deploy with N streams then the A2F pipeline will process at maximum the audio data of N gRPC clients at the same time. Supplementary clients will be rejected.

The higher this stream number is, the lower the per-stream-FPS will be. (As compute will be shared among more clients) Also this will increase GPU RAM usage.

E.g.: on a RTX 4090 with mark_v2.1 and fp16 trt model

  • a 1-stream Audio2Face Microservice consumes ~2.2GB of GPU RAM

  • a 5-streams Audio2Face Microservice consumes ~3.0GB of GPU RAM

  • a 10-streams Audio2Face Microservice consumes ~4.2GB of GPU RAM

The selected stream number must be adjusted to the number of concurrent clients you expect to serve. If your goals is to serve as many client as possible then you should make sure that the stream number selected allows FPS for all streams to be above 30. Output FPS being lower than 30 will create stuttering.

Configuration

a2f_config.yaml
common:
  # Determines:
  # * the maximum number of clients connected at the same time
  # * the batch size for the inference
  # This number must be as close as possible to your use-case
  # If it's too low you won't be able to serve all the clients in parallel
  # If it's too high the performance of the service will degrade
  stream_number: 10
  # Adds 1.5 seconds of silence at the end of the audio clip and reset the emotion to neutral
  # This can be useful for specific use-cases
  # E.g.: If you want to make sure that the mouth of all Avatar closes and goes back to neutral
  # after processing the audio clip
  # However we recommend not to use it, to provide more flexibility to clients connecting to the
  # service. As these clients can also take care of sending this silence and neutral emotion
  add-silence-padding-after-audio: false
  # In the current design of the A2F service there are queues in between processing nodes of a
  # specific max-size.
  # So that we have:
  # Streammux => queue-after-streammux => A2E => queue-after-a2e => A2F => queue-after-a2f
  # The maximum number of buffer stored in these queues are controlled here.
  # If you are unsure, you should keep the default config file
  queue-size-after-streammux: 1
  queue-size-after-a2e: 1 
  queue-size-after-a2f: 300
  # Maximum size of the IDs provided in the gRPC header of `a2x-interface`
  max-len-uuid: 50
  # Minimum allowed sample rate
  min-sample-rate: 16000
  # Maximum allowed sample rate
  max-sample-rate: 144000

grpc_input:
  # Input port
  port: 50000
  # Minimum amount of audio FPS that should be provided
  # If the client FPS are too low ( client FPS < `low-fps`)
  # for more than `low-fps-max-duration-second` seconds
  # then A2F Service consider the client as faulty and interrupts
  # the connection as the output streaming quality would be too
  # low.
  low-fps: 29
  low-fps-max-duration-second: 7

grpc_output:
  # Where to connect to send the animation data
  ip: 0.0.0.0
  port: 51000

A2E:
  # Whether to enable A2E
  enabled: true
  # How often to perform A2E inference on the given data
  inference-interval: 10
  # where A2E network is located
  model_path: "/opt/nvidia/a2f_pipeline/a2e_data/data/networks/"
  # Post-processing emotion config
  emotions:
    # Increases the spread between emotion values by pushing them higher or lower.
    # Default value: 1
    # Min: 0.3
    # Max: 3
    emotion_contrast: 1.0
    # Coefficient for smoothing emotions over time
    #  0 means no smoothing at all (can be jittery)
    #  1 means extreme smoothing (emotion values not updated over time)
    # Default value: 0.7
    # Min: 0
    # Max: 1
    live_blend_coef: 0.7
    # Sets the strength of the preferred emotions (passed as input) relative to emotions detected by A2E.
    # 0 means only A2E output will be used for emotion rendering.
    # 1 means only the preferred emotions will be used for emotion rendering.
    # Default value: 0.5
    # Min: 0
    # Max: 1
    preferred_emotion_strength: 0.5
    # Activate blending between the preferred emotions (passed as input) and the emotions detected by A2E.
    # Default: True
    enable_preferred_emotion: true
    # Sets the strength of generated emotions relative to neutral emotion.
    # This multiplier is applied globally after the mix of emotion is done.
    # If set to 0, emotion will be neutral.
    # If set to 1, the blend of emotion will be fully used. (can be too intense)
    # Default value: 0.6
    # Min: 0
    # Max: 1
    emotion_strength: 0.6
    # Sets a firm limit on the quantity of emotion sliders engaged by A2E
    # emotions with the highest weight will be prioritized
    # Default value: 3
    # Min: 1
    # Max: 6
    max_emotions: 3


A2F:
  # A2F model path to use, that's a path internal to the docker container
  model_path: "/opt/nvidia/a2f_pipeline/a2f_data/data/networks/claire_v1.3"
  # Default multiplier to apply to the blendshape output of A2F
  api:
    bs_weight_multipliers: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]