Migrating from EA3

Audio2Face Controller

Service Interface

The old interface exposed the following proto:

service A2XServiceInterface {
 rpc ConvertAudioToAnimData(stream A2XAudioStream) returns (stream A2XAnimDataStream) {}
}

This has been renammed to:

service A2FControllerService {
  rpc ProcessAudioStream(stream nvidia_ace.controller.v1.AudioStream)
      returns (stream nvidia_ace.controller.v1.AnimationDataStream) {}
}

It is still a bi-directionnal stream so not much changes from this point of view.

Audio data

Old protobuf audio data serialization:

message A2XAudioStream {
  bytes audio_chunk = 1;
  map<string, float> emotion_map = 2;
  string posture_var = 3;
  PacketType type = 4;
}

The A2XAudioStream message has been replaced by the AudioStream message:

message AudioStream {
  message EndOfAudio {}

  oneof stream_part {
    AudioStreamHeader audio_stream_header = 1;
    nvidia_ace.a2f.v1.AudioWithEmotion audio_with_emotion = 2;
    EndOfAudio end_of_audio = 3;
  }
}

Now instead of a PacketType you can send either AudioStreamHeader, AudioWithEmotion or EndOfAudio content.

Audio header

The PacketType.BEGIN has been replaced by a proper AudioHeader which has the following fields:

message AudioHeader {
  enum AudioFormat { AUDIO_FORMAT_PCM = 0; }

  AudioFormat audio_format = 1;

  // Currently only mono sound must be supported.
  uint32 channel_count = 2;

  // Defines the sample rate of the provided audio data
  uint32 samples_per_second = 3;

  // Currently only 16 bits per sample must be supported.
  uint32 bits_per_sample = 4;
}

You need to fill the content of the fields according to the specifity of the audio content. Note that only 16 bits mono PCM audio is currently supported.

The AudioStreamHeader contains more fields to parametrize the output of A2F and has the following fields:

message AudioStreamHeader {
  nvidia_ace.audio.v1.AudioHeader audio_header = 1;

  // New additional parameters, see documentation.
  nvidia_ace.a2f.v1.FaceParameters face_params = 2;
  nvidia_ace.a2f.v1.EmotionPostProcessingParameters emotion_post_processing_params = 3;
  nvidia_ace.a2f.v1.BlendShapeParameters blendshape_params = 4;
}

Audio content

The audio_chunk and emotion_map fields have been moved and renammed to the AudioWithEmotion message.

message AudioWithEmotion {
  bytes audio_buffer = 1;

  repeated nvidia_ace.emotion_with_timecode.v1.EmotionWithTimeCode emotions = 2;
}

The new version of A2F allows for change of emotion during the processing, so now instead of a simple emotion_map you need to send an EmotionWithTimeCode object which contains the following fields:

message EmotionWithTimeCode {
  double time_code = 1;
  map<string, float> emotion = 2;
}

End of audio

Now the end of audio file is marked with a empty EndOfStream packet instead of the PacketType enum.

Animation data

The old A2XAnimDataStream prototype contained the following three messages:

message A2XAnimDataStreamHeader {
  bool success = 1;
  string message = 2;
}

message A2XAnimDataStreamInformation {
  int32 code = 1;
  string message = 2;
}

message A2XAnimDataStreamContent {
  string usda = 1;
  map<string, bytes> files = 2;
}

Now the animation data has migrated from USD string format to USD represented by gRPC messages to allow better data compression, the new AnimationDataStream contains the following fields:

message AnimationDataStream {
  oneof stream_part {
    AnimationDataStreamHeader animation_data_stream_header = 1;
    nvidia_ace.animation_data.v1.AnimationData animation_data = 2;
    Event event = 3;
    nvidia_ace.status.v1.Status status = 4;
  }
}

message AnimationData {
  optional SkelAnimation skel_animation = 1;
  optional AudioWithTimeCode audio = 2;
  optional Camera camera = 3;

  // Metadata such as emotion aggregates, etc...
  map<string, google.protobuf.Any> metadata = 4;
}

Animation data is now contained in SkelAnimation objects in the AnimationData message. See grpc prototypes documentation for further explanations.

Audio2Face without controller

The A2F prototypes are fairly similar to those of A2F Controller except for the AudioStreamHeader message which adds extra metadata called animation_ids for request handling.

message AudioStreamHeader {
  // IDs of the current stream
  nvidia_ace.animation_id.v1.AnimationIds animation_ids = 1;

  nvidia_ace.audio.v1.AudioHeader audio_header = 2;

  // Parameters for updating the facial characteristics of an avatar
  // See the documentation for more information
  FaceParameters face_params = 3;

  // Parameters relative to the emotion blending and processing
  // before using it to generate blendshapes
  // See the documentation for more information
  EmotionPostProcessingParameters emotion_post_processing_params = 4;

  // Multipliers and offsets to apply to the generated blendshape values
  BlendShapeParameters blendshape_params = 5;
}

Following the same pattern, AnimationDataHeader coming out of A2F also contains metadata.

message AnimationDataStreamHeader {
  nvidia_ace.animation_id.v1.AnimationIds animation_ids = 1;

  optional string source_service_id = 2;

  optional nvidia_ace.audio.v1.AudioHeader audio_header = 3;

  optional nvidia_ace.animation_data.v1.SkelAnimationHeader skel_animation_header = 4;

  double start_time_code_since_epoch = 5;
}

These IDs allow internal tracking of: * the audio clips being processed with request_id * the current 3D model where to apply multiple audio clips with stream_id