Animation Data Format

Animation data is the data format that is used to send avatar pose or animation data across animation related microservices. It is currently used to send animation data from an animation source (e.g. Audio2Face microservice) to an animation compositor (e.g. Animation Graph microservice) and from there to a renderer (e.g. the Omniverse Renderer microservice). The format can hold a single avatar pose frame with the corresponding audio or an entire animation sequence with corresponding audio. In the future this format might get extended to support camera poses, etc.

Animation Data Service

The animation data service offers to two RPCs to either push animation data from a client to a server or the other way around to pull animation data from a server to a client.

Animation Data Stream

Both RPCs for pushing and pulling animation data implement the same animation data stream protocol. The animation data payload is USDA-inspired, but it includes several changes to improve streaming. E.g. the avatar pose data has been separated in a header that is only sent once at the beginning of the stream and one or more animation data chunks that are sent after the animation data header.

nvidia_ace.animation_data.v1.proto

syntax = "proto3";

package nvidia_ace.animation_data.v1;

import "nvidia_ace.animation_id.v1.proto";
import "nvidia_ace.audio.v1.proto";
import "nvidia_ace.status.v1.proto";

import "google/protobuf/any.proto";

// IMPORTANT NOTE: this is an AnimationDataStreamHeader WITH ID
// A similar AudioStreamHeader exist in nvidia_ace.controller.v1.proto
// but that one does NOT contain IDs
message AnimationDataStreamHeader {
  nvidia_ace.animation_id.v1.AnimationIds animation_ids = 1;

  // This is required to identify from which animation source (e.g. A2F) the
  // request originates. This allows us to map the incoming animation data
  // stream to the correct pose provider animation graph node. The animation
  // source MSs (e.g. A2F MS) should populate this with their name. (e.g. A2F).
  // Example Value: "A2F MS"
  optional string source_service_id = 2;

  // Metadata of the audio buffers. This defines the audio clip properties
  // at the beginning the streaming process.
  optional nvidia_ace.audio.v1.AudioHeader audio_header = 3;

  // Metadata containing the blendshape and joints names.
  // This defines the names of the blendshapes and joints flowing though a stream.
  optional nvidia_ace.animation_data.v1.SkelAnimationHeader
      skel_animation_header = 4;

  // Animation data streams use time codes (`time_code`) to define the temporal
  // position of audio (e.g. `AudioWithTimeCode`), animation key frames (e.g.
  // `SkelAnimation`), etc. relative to the beginning of the stream. The unit of
  // `time_code` is seconds. In addition, the `AnimationDataStreamHeader` also
  // provides the `start_time_code_since_epoch` field, which defines the
  // absolute start time of the animation data stream. This start time is stored
  // in seconds elapsed since the Unix time epoch.
  double start_time_code_since_epoch = 5;

  // A generic metadata field to attach use case specific data (e.g. session id,
  // or user id?) map<string, string> metadata = 6; map<string,
  // google.protobuf.Any> metadata = 6;
}

// This message represent each message of a stream of animation data.
message AnimationDataStream {
  oneof stream_part {
    // The header must be sent as the first message.
    AnimationDataStreamHeader animation_data_stream_header = 1;
    // Then one or more animation data message must be sent.
    nvidia_ace.animation_data.v1.AnimationData animation_data = 2;
    // The status must be sent last and may be sent in between.
    nvidia_ace.status.v1.Status status = 3;
  }
}

message AnimationData {
  optional SkelAnimation skel_animation = 1;
  optional AudioWithTimeCode audio = 2;
  optional Camera camera = 3;

  // Metadata such as emotion aggregates, etc...
  map<string, google.protobuf.Any> metadata = 4;
}

message AudioWithTimeCode {
  // The time code is relative to the `start_time_code_since_epoch`.
  // Example Value: 0.0 (for the very first audio buffer flowing out of a service)
  double time_code = 1;
  // Audio Data in bytes, for how to interpret these bytes you need to refer to
  // the audio header.
  bytes audio_buffer = 2;
}

message SkelAnimationHeader {
  // Names of the blendshapes only sent once in the header
  // The position of these names is the same as the position of the values
  // of the blendshapes messages
  // As an example if the blendshape names are ["Eye Left", "Eye Right", "Jaw"]
  // Then when receiving blendshape data over the streaming process
  // E.g.: [0.1, 0.5, 0.2] & timecode = 0.0
  // The pairing will be for timecode=0.0, "Eye Left"=0.1,  "Eye Right"=0.5, "Jaw"=0.2
  repeated string blend_shapes = 1;
  // Names of the joints only sent once in the header
  repeated string joints = 2;
}

message SkelAnimation {
  // Time codes must be strictly monotonically increasing.
  // Two successive SkelAnimation messages must not have overlapping time code
  // ranges.
  repeated FloatArrayWithTimeCode blend_shape_weights = 1;
  repeated Float3ArrayWithTimeCode translations = 2;
  repeated QuatFArrayWithTimeCode rotations = 3;
  repeated Float3ArrayWithTimeCode scales = 4;
}

message Camera {
  repeated Float3WithTimeCode position = 1;
  repeated QuatFWithTimeCode rotation = 2;

  repeated FloatWithTimeCode focal_length = 3;
  repeated FloatWithTimeCode focus_distance = 4;
}

message FloatArrayWithTimeCode {
  double time_code = 1;
  repeated float values = 2;
}

message Float3ArrayWithTimeCode {
  double time_code = 1;
  repeated Float3 values = 2;
}

message QuatFArrayWithTimeCode {
  double time_code = 1;
  repeated QuatF values = 2;
}

message Float3WithTimeCode {
  double time_code = 1;
  Float3 value = 2;
}

message QuatFWithTimeCode {
  double time_code = 1;
  QuatF value = 2;
}

message FloatWithTimeCode {
  double time_code = 1;
  float value = 2;
}

message QuatF {
  float real = 1;
  float i = 2;
  float j = 3;
  float k = 4;
}

message Float3 {
  float x = 1;
  float y = 2;
  float z = 3;
}
//nvidia_ace.animation_data.v1
//v1.0.0

Standard Rig

The animation data format is very generic and can technically support various rig. To increase the interoperability of the animation microservices, various animation microservices use a standardized rig biped character rig. The rig’s topology and joint naming is illustrated below.

The standard biped rig and the corresponding joint topology and naming

The face part of the rig is based on the blendshapes defined by Apple’s ARKit.

Warning

The definition of the blendshape mouthClose deviates from the standard ARKit version. The shape includes the opening of the jaw as illustrated below.

Coordinate Frame

The coordinate frame used in animation data is defined so that the Y-axis points up, the X-axis points to the left (viewed from the avatar), and the Z-axis points to the front of the avatar.