Audio2Face-3D Authoring Microservice#

Overview#

The Audio2Face-3D (A2F-3D) Authoring microservice is a complementary component of our facial animation technology stack. It is designed to iterate over Face & Emotions parameters for a specified audio clip. This allows seeing parameters updates on an Avatar Face in real-time.

Communication#

The A2F-3D Authoring microservice answers requests from a client RPC. There are 2 kind of requests:

  1. UploadAudioClip:

    This RPC Uploads the audio clip to the AuthoringService to be processed. It returns a hash value corresponding the to hashed audio clip and a list of blendshape keys.

  2. GetAvatarFacePose:

    This RPC Requests a single animation frame at the specified timecode. It returns a list of blendshape values and emotion values corresponding to this audio timecode and parameters. To explore what can be authored, review the Protobuf definition for the GetAvatarFacePose method’s input in Protobuf data.

Note

The Audio2Face-3D Authoring microservice processes frames individually for inference. Post processing parameters like emotion_contrast and live_blend_coef rely on multiple frames for smoothing, making them inapplicable in this context.

Currently, we only support mono 16-bits PCM audio format with arbitrary samplerate.

Face Parameters

The supported face parameters by Audio2Face-3D authoring are:

Parameter

Min

Max

Description

skinStrength

0.0

2.0

Controls the skin’s range of motion.

upperFaceStrength

0.0

2.0

Controls the range of the motion of the upper region of the face.

lowerFaceStrength

0.0

2.0

Controls the range of the motion of the lower region of the face.

eyelidOpenOffset

-1.0

1.0

Adjusts the default pose of the eyelid (-1.0 means fully closed. 1.0 means fully open.)

lipOpenOffset

-0.2

0.2

Adjusts the default pose of lip (-1.0 means fully closed. 1.0 means fully open.)

upperFaceSmoothing

0.0

0.1

Smooths the motions on the upper region of the face.

lowerFaceSmoothing

0.0

0.1

Smooths the motions on the lower region of the face.

faceMaskLevel

0.0

1.0

Determines the boundary between the upper and lower region of the face.

faceMaskSoftness

0.001

0.5

Determines how smoothly the upper and lower face regions blend on the mask boundary.

Additional parameters may appear occasionally in the configuration files; however, they do not impact the avatar’s facial expressions. Examples of such parameters include blinkStrength, tongueStrength, tongueHeightOffset, and tongueDepthOffset.

Note

The Audio2Face-3D Authoring microservice processes frames individually for inference. Parameters like upperFaceSmoothing and lowerFaceSmoothing rely on multiple frames for smoothing, making them inapplicable in this context.

Blendshapes

Audio2Face-3D Authoring outputs blendshapes. See ARKit blendShape documentation for more information.

Audio2Face-3D does not animate head, tongue and eye movement.

The following blend shape values will always be 0:

  • EyeLookDownRight

  • EyeLookInRight

  • EyeLookOutRight

  • EyeLookUpRight

  • EyeLookDownLeft

  • EyeLookInLeft

  • EyeLookOutLeft

  • EyeLookUpLeft

  • TongueOut

  • HeadRoll_deprecated

  • HeadPitch_deprecated

  • HeadYaw_deprecated

in the Audio2Face-3D Authoring output.

Note

The definition of the blendshape mouthClose deviates from the standard ARKit version. The shape includes the opening of the jaw.

Batch size#

Audio2Face-3D Authoring is performing batched inference to optimize compute and serve multiple users at the same time. When deploying the Microservice you can update this batch size in the configuration file.

The higher this batch size is:

  • the better the overall throughput will be

  • the higher the latency will be

  • the higher GPU RAM usage will be

The overall throughput is limited by the gpu processing power.

Configuration#

The following configuration files are used for the A2F-3D Authoring MS.

Note

The clib_db_ttl_refresh and clib_db_ttl_check_interval_seconds options contain a typo in their names. This will be corrected in upcoming versions.

For James:

james_v2.3-config.json
{
  "a2e_batch_size": 10,
  "a2e_config_path": "/app/configs/a2e-config.json",
  "a2f_batch_size": 10,
  "a2f_config_path": "/app/configs/james_v2.3-proc-config.json",
  "endpoint": "0.0.0.0:50051",
  "clip_db_ttl_seconds": 3600,
  "clip_db_max_size_bytes": 10737418240,
  "clib_db_ttl_refresh_on_use": false,
  "clib_db_ttl_check_interval_seconds": 60
}

For Mark:

mark_v2.3-config.json
{
  "a2e_batch_size": 10,
  "a2e_config_path": "/app/configs/a2e-config.json",
  "a2f_batch_size": 10,
  "a2f_config_path": "/app/configs/mark_v2.3-proc-config.json",
  "endpoint": "0.0.0.0:50051",
  "clip_db_ttl_seconds": 3600,
  "clip_db_max_size_bytes": 10737418240,
  "clib_db_ttl_refresh_on_use": false,
  "clib_db_ttl_check_interval_seconds": 60
}

For Claire:

claire_v2.3-config.json
{
  "a2e_batch_size": 10,
  "a2e_config_path": "/app/configs/a2e-config.json",
  "a2f_batch_size": 10,
  "a2f_config_path": "/app/configs/claire_v2.3-proc-config.json",
  "endpoint": "0.0.0.0:50051",
  "clip_db_ttl_seconds": 3600,
  "clip_db_max_size_bytes": 10737418240,
  "clib_db_ttl_refresh_on_use": false,
  "clib_db_ttl_check_interval_seconds": 60
}

We recommend to only tweak the following parameters in the config file if you need to:

  • a2e_batch_size: You can set this number to a number between 1 and 256; we recommend 10.

  • a2f_batch_size: You can set this number to a number between 1 and 256; we recommend 10.

  • endpoint: You can update the port from 50051 to another port of your convenience if needed. Default is: “0.0.0.0:50051”.

  • clip_db_max_size_bytes: Maximum size in Bytes of the audio storage. Default is 10737418240 Bytes (10 GiB).

  • clip_db_ttl_seconds: Time in seconds for an audio file to remain present on the server. Default is 3600 seconds (1 Hour).

  • clib_db_ttl_refresh_on_use: Indicates whether the TTL countdown for an audio file should reset upon use (e.g., during a service request). Default is false

  • clib_db_ttl_check_interval_seconds: The frequency, in seconds, at which the system checks for expired TTLs. Default is 60 seconds (1 Minute).

For more information about the audio clip storage see this page.