Animation Data Format

Animation data is the data format that is used to send avatar pose or animation data across animation related microservices. It is currently used to send animation data from an animation source (e.g. Audio2Face microservice) to an animation compositor (e.g. Animation Graph microservice) and from there to a renderer (e.g. the Omniverse Renderer microservice). The format can hold a single avatar pose frame with the corresponding audio or an entire animation sequence with corresponding audio. In the future this format might get extended to support camera poses, etc.

Animation Data Service

The animation data service offers to two RPCs to either push animation data from a client to a server or the other way around to pull animation data from a server to a client.

nvidia_ace.services.animation_data.v1.proto
syntax = "proto3";

package nvidia_ace.services.animation_data.v1;

import "nvidia_ace.animation_data.v1.proto";
import "nvidia_ace.animation_id.v1.proto";
import "nvidia_ace.status.v1.proto";

// 2 RPC exist to provide a stream of animation data
// The RPC to implement depends on if the part of the service
// is a client or a server.
// E.g.: In the case of Animation Graph Microservice, we implement both RPCs.
// One to receive and one to send.
service AnimationDataService {
  // When the service creating the animation data is a client from the service receiving them
  // This push RPC must be used.
  // An example for that is Audio2Face Microservice creating animation data and sending them
  // to Animation Graph Microservice
  rpc PushAnimationDataStream(stream nvidia_ace.animation_data.v1.AnimationDataStream)
      returns (nvidia_ace.status.v1.Status) {}
  // When the service creating the animation data is a server from the service receiving them
  // This pull RPC must be used.
  // An example for that is the Omniverse Renderer Microservice requesting animation data to the
  // Animation Graph Microservice.
  rpc PullAnimationDataStream(nvidia_ace.animation_id.v1.AnimationIds)
      returns (stream nvidia_ace.animation_data.v1.AnimationDataStream) {}
}
//nvidia_ace.services.animation_data.v1
//v1.0.0

Animation Data Stream

Both RPCs for pushing and pulling animation data implement the same animation data stream protocol. The animation data payload is USDA-inspired, but it includes several changes to improve streaming. E.g. the avatar pose data has been separated in a header that is only sent once at the beginning of the stream and one or more animation data chunks that are sent after the animation data header.

nvidia_ace.animation_data.v1.proto
syntax = "proto3";

package nvidia_ace.animation_data.v1;

import "nvidia_ace.animation_id.v1.proto";
import "nvidia_ace.audio.v1.proto";
import "nvidia_ace.status.v1.proto";

import "google/protobuf/any.proto";

// IMPORTANT NOTE: this is an AnimationDataStreamHeader WITH ID
// A similar AudioStreamHeader exist in nvidia_ace.controller.v1.proto
// but that one does NOT contain IDs
message AnimationDataStreamHeader {
  nvidia_ace.animation_id.v1.AnimationIds animation_ids = 1;

  // This is required to identify from which animation source (e.g. A2F) the
  // request originates. This allows us to map the incoming animation data
  // stream to the correct pose provider animation graph node. The animation
  // source MSs (e.g. A2F MS) should populate this with their name. (e.g. A2F).
  // Example Value: "A2F MS"
  optional string source_service_id = 2;

  // Metadata of the audio buffers. This defines the audio clip properties
  // at the beginning the streaming process.
  optional nvidia_ace.audio.v1.AudioHeader audio_header = 3;

  // Metadata containing the blendshape and joints names.
  // This defines the names of the blendshapes and joints flowing though a stream.
  optional nvidia_ace.animation_data.v1.SkelAnimationHeader
      skel_animation_header = 4;

  // Animation data streams use time codes (`time_code`) to define the temporal
  // position of audio (e.g. `AudioWithTimeCode`), animation key frames (e.g.
  // `SkelAnimation`), etc. relative to the beginning of the stream. The unit of
  // `time_code` is seconds. In addition, the `AnimationDataStreamHeader` also
  // provides the `start_time_code_since_epoch` field, which defines the
  // absolute start time of the animation data stream. This start time is stored
  // in seconds elapsed since the Unix time epoch.
  double start_time_code_since_epoch = 5;

  // A generic metadata field to attach use case specific data (e.g. session id,
  // or user id?) map<string, string> metadata = 6; map<string,
  // google.protobuf.Any> metadata = 6;
}

// This message represent each message of a stream of animation data.
message AnimationDataStream {
  oneof stream_part {
    // The header must be sent as the first message.
    AnimationDataStreamHeader animation_data_stream_header = 1;
    // Then one or more animation data message must be sent.
    nvidia_ace.animation_data.v1.AnimationData animation_data = 2;
    // The status must be sent last and may be sent in between.
    nvidia_ace.status.v1.Status status = 3;
  }
}

message AnimationData {
  optional SkelAnimation skel_animation = 1;
  optional AudioWithTimeCode audio = 2;
  optional Camera camera = 3;

  // Metadata such as emotion aggregates, etc...
  map<string, google.protobuf.Any> metadata = 4;
}

message AudioWithTimeCode {
  // The time code is relative to the `start_time_code_since_epoch`.
  // Example Value: 0.0 (for the very first audio buffer flowing out of a service)
  double time_code = 1;
  // Audio Data in bytes, for how to interpret these bytes you need to refer to
  // the audio header.
  bytes audio_buffer = 2;
}

message SkelAnimationHeader {
  // Names of the blendshapes only sent once in the header
  // The position of these names is the same as the position of the values
  // of the blendshapes messages
  // As an example if the blendshape names are ["Eye Left", "Eye Right", "Jaw"]
  // Then when receiving blendshape data over the streaming process
  // E.g.: [0.1, 0.5, 0.2] & timecode = 0.0
  // The pairing will be for timecode=0.0, "Eye Left"=0.1,  "Eye Right"=0.5, "Jaw"=0.2
  repeated string blend_shapes = 1;
  // Names of the joints only sent once in the header
  repeated string joints = 2;
}

message SkelAnimation {
  // Time codes must be strictly monotonically increasing.
  // Two successive SkelAnimation messages must not have overlapping time code
  // ranges.
  repeated FloatArrayWithTimeCode blend_shape_weights = 1;
  repeated Float3ArrayWithTimeCode translations = 2;
  repeated QuatFArrayWithTimeCode rotations = 3;
  repeated Float3ArrayWithTimeCode scales = 4;
}

message Camera {
  repeated Float3WithTimeCode position = 1;
  repeated QuatFWithTimeCode rotation = 2;

  repeated FloatWithTimeCode focal_length = 3;
  repeated FloatWithTimeCode focus_distance = 4;
}

message FloatArrayWithTimeCode {
  double time_code = 1;
  repeated float values = 2;
}

message Float3ArrayWithTimeCode {
  double time_code = 1;
  repeated Float3 values = 2;
}

message QuatFArrayWithTimeCode {
  double time_code = 1;
  repeated QuatF values = 2;
}

message Float3WithTimeCode {
  double time_code = 1;
  Float3 value = 2;
}

message QuatFWithTimeCode {
  double time_code = 1;
  QuatF value = 2;
}

message FloatWithTimeCode {
  double time_code = 1;
  float value = 2;
}

message QuatF {
  float real = 1;
  float i = 2;
  float j = 3;
  float k = 4;
}

message Float3 {
  float x = 1;
  float y = 2;
  float z = 3;
}
//nvidia_ace.animation_data.v1
//v1.0.0
nvidia_ace.animation_id.v1.proto
syntax = "proto3";

package nvidia_ace.animation_id.v1;

message AnimationIds {

  // This is required to track a single animation source (e.g. A2X) request
  // through the animation pipeline. This is going to allow e.g. the controller
  // to stop a request after it has been sent to the animation compositor (e.g.
  // animation graph).
  // Example Value: "8b09637f-737e-488c-872e-e367e058aa15"
  // Note1: The above value is an example UUID (https://en.wikipedia.org/wiki/Universally_unique_identifier)
  // Note2: You don't need to provide specifically a UUID, any text should work, however UUID are recommended
  // for their low chance of collision
  string request_id = 1;

  // The stream id is shared across the animation pipeline and identifies all
  // animation data streams that belong to the same stream. Thus, there will be
  // multiple request all belonging to the same stream. Different user sessions,
  // will usually result in a new stream id. This is required for stateful MSs
  // (e.g. anim graph) to map different requests to the same stream.
  // Example Value: "17f1fefd-3812-4211-94e8-7af1ef723d7f"
  // Note1: The above value is an example UUID (https://en.wikipedia.org/wiki/Universally_unique_identifier)
  // Note2: You don't need to provide specifically a UUID, any text should work, however UUID are recommended
  // for their low chance of collision
  string stream_id = 2;

  // This identifies the target avatar or object the animation data applies to.
  // This is required when there are multiple avatars or objects in the scene.
  // Example Value: "AceModel"
  string target_object_id = 3;
}
//nvidia_ace.animation_id.v1
//v1.0.0
nvidia_ace.audio.v1.proto
syntax = "proto3";

package nvidia_ace.audio.v1;

message AudioHeader {
  enum AudioFormat { AUDIO_FORMAT_PCM = 0; }

  // Example value: AUDIO_FORMAT_PCM
  AudioFormat audio_format = 1;

  // Currently only mono sound must be supported.
  // Example value: 1
  uint32 channel_count = 2;

  // Defines the sample rate of the provided audio data
  // Example value: 16000
  uint32 samples_per_second = 3;

  // Currently only 16 bits per sample must be supported.
  // Example value: 16
  uint32 bits_per_sample = 4;
}
//nvidia_ace.audio.v1
//v1.0.0
nvidia_ace.status.v1.proto
syntax = "proto3";

package nvidia_ace.status.v1;

// This status message indicates the result of an operation
// Refer to the rpc using it for more information
message Status {
  enum Code {
    SUCCESS = 0;
    INFO = 1;
    WARNING = 2;
    ERROR = 3;
  }
  // Type of message returned by the service
  // Example value: SUCCESS
  Code code = 1;
  // Message returned by the service
  // Example value: "Audio processing completed successfully!"
  string message = 2;
}
//nvidia_ace.status.v1
//v1.0.0

Standard Rig

The animation data format is very generic and can technically support various rig. To increase the interoperability of the animation microservices, various animation microservices use a standardized rig biped character rig. The rig’s topology and joint naming is illustrated below.

The standard biped rig and the corresponding joint topology and naming

The standard biped rig and the corresponding joint topology and naming

The face part of the rig is based on the blendshapes defined by Apple’s ARKit.

Warning

The definition of the blendshape mouthClose deviates from the standard ARKit version. The shape includes the opening of the jaw as illustrated below.

The definition of the blendshape ``mouthClose`` deviates from the standard ARKit version.

Coordinate Frame

The coordinate frame used in animation data is defined so that the Y-axis points up, the X-axis points to the left (viewed from the avatar), and the Z-axis points to the front of the avatar.