Using the AR Features#

This section provides information about how to use the AR features.

Face Detection and Tracking#

This section provides information about how to use the Face Detection and Tracking feature.

Face Detection for Static Frames (Images)#

To obtain detected bounding boxes, you can explicitly instantiate and run the face detection feature as follows, with the feature taking an image buffer as input.

The following example runs the Face Detection AR feature with an input image buffer and output memory to hold bounding boxes:

//Set input image buffer
NvAR_SetObject(faceDetectHandle, NvAR_Parameter_Input(Image), &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(faceDetectHandle, NvAR_Parameter_Output(BoundingBoxes), &output_bboxes, sizeof(NvAR_BBoxes));

//Optional: If desired, set memory for bounding-box confidence values
NvAR_Run(faceDetectHandle);

Face Tracking for Temporal Frames (Videos)#

If Temporal is enabled, such as when you process a video frame instead of an image, only one face is returned. The largest face appears for the first frame, and this face is subsequently tracked over the following frames.

However, explicitly calling the face detection feature is not the only way to obtain a bounding box that denotes detected faces. For more information about how to use the Landmark Detection or Face3D Reconstruction AR features and return a face bounding box, see Landmark Detection and Tracking and Face 3D Mesh and Tracking.

Landmark Detection and Tracking#

This section provides information about how to use the Landmark Detection and Tracking feature.

Landmark Detection for Static Frames (Images)#

Typically, the input to the landmark detection feature is an input image and a batch of bounding boxes. Currently, the maximum value is 1. These boxes denote the regions of the image that contain the faces on which you want to run landmark detection.

The following example runs the Landmark Detection AR feature after obtaining bounding boxes from Face Detection:

//Set input image buffer
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Pass output bounding boxes from face detection as an input on which
//landmark detection is to be run
NvAR_SetObject(landmarkDetectHandle,
  NvAR_Parameter_Input(BoundingBoxes), &output_bboxes,
  sizeof(NvAR_BBoxes));

//Set landmark detection mode: Performance (0; default) or Quality (1)
unsigned int mode = 0; // Choose performance mode
NvAR_SetU32(landmarkDetectHandle, NvAR_Parameter_Config(Mode), mode);

//Set output buffer to hold detected facial keypoints
std::vector<NvAR_Point2f> facial_landmarks;
facial_landmarks.assign(OUTPUT_SIZE_KPTS, {0.f, 0.f});
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Output(Landmarks),
  facial_landmarks.data(),sizeof(NvAR_Point2f));

NvAR_Run(landmarkDetectHandle);

Alternative Usage of Landmark Detection#

As described in Configuration Properties for Landmark Tracking, the Landmark Detection AR feature supports some optional parameters that determine how the feature can be run.

If bounding boxes are not provided to the Landmark Detection AR feature as inputs, face detection is automatically run on the input image, and the largest face bounding box is selected on which to run landmark detection.

If BoundingBoxes is set as an output property, the property is populated with the selected bounding box that contains the face on which the landmark detection was run. Landmarks is not an optional property; to explicitly run this feature, this property must be set with a provided output buffer.

Landmark Tracking for Temporal Frames (Videos)#

Additionally, if Temporal is enabled, such as when you process a video stream and face detection is run explicitly, only one bounding box is supported as an input for landmark detection.

When face detection is not explicitly run, by providing an input image instead of a bounding box, the largest detected face is automatically selected. The detected face and landmarks are then tracked as an optimization across temporally related frames.

Note

The internally determined bounding box can be queried from this feature, but is not required for the feature to run.

The following example uses the Landmark Detection AR feature to obtain landmarks directly from the image, without first explicitly running Face Detection:

//Set input image buffer
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for landmarks
std::vector<NvAR_Point2f> facial_landmarks;
facial_landmarks.assign(batchSize * OUTPUT_SIZE_KPTS, {0.f, 0.f});
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Output(Landmarks),
  facial_landmarks.data(),sizeof(NvAR_Point2f));

//Set landmark detection mode: Performance (0; default) or Quality (1)
unsigned int mode = 0; // Choose performance mode
NvAR_SetU32(landmarkDetectHandle, NvAR_Parameter_Config(Mode), mode);

//Optional: If desired, set memory for bounding box
NvAr_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(landmarkDetectHandle,
  NvAR_Parameter_Output(BoundingBoxes), &output_bboxes,
  sizeof(NvAr_BBoxes));

//Optional: If desired, set memory for pose, landmark confidence, or
//even bounding box confidence

NvAR_Run(landmarkDetectHandle);

Face 3D Mesh and Tracking#

This section provides information about how to use the Face 3D Mesh and Tracking feature.

Face 3D Mesh for Static Frames (Images)#

Typically, the input to Face 3D Mesh feature is an input image and a set of detected landmark points corresponding to the face on which we want to run 3D reconstruction.

The following example demonstrates the typical usage of this feature, where the detected facial keypoints from the Landmark Detection feature are passed as input to this feature:

//Set facial keypoints from Landmark Detection as an input
err = NvAR_SetObject(faceFitHandle, NvAR_Parameter_Input(Landmarks),
  facial_landmarks.data(),sizeof(NvAR_Point2f));

//Set output memory for face mesh
NvAR_FaceMesh face_mesh = new NvAR_FaceMesh();
unsigned int n;
err = NvAR_GetU32(faceFitHandle, NvAR_Parameter_Config(VertexCount),
  &n);
face_mesh->num_vertices = n;
err = NvAR_GetU32(faceFitHandle, NvAR_Parameter_Config(TriangleCount),
  &n);
face_mesh->num_triangles = n;
face_mesh->vertices = new NvAR_Vector3f[face_mesh->num_vertices];
face_mesh->tvi = new NvAR_Vector3u16[face_mesh->num_triangles];
err = NvAR_SetObject(faceFitHandle, NvAR_Parameter_Output(FaceMesh),
  face_mesh, sizeof(NvAR_FaceMesh));

//Set output memory for rendering parameters
NvAR_RenderingParams rendering_params = new NvAR_RenderingParams();
err = NvAR_SetObject(faceFitHandle,
  NvAR_Parameter_Output(RenderingParams), rendering_params,
  sizeof(NvAR_RenderingParams));

err = NvAR_Run(faceFitHandle);

Alternative Usage of the Face 3D Mesh Feature#

Similar to the alternative usage of the Landmark Detection feature, the Face 3D Mesh AR feature can be used to determine the detected face bounding box, the facial keypoints, and a 3D face mesh and its rendering parameters.

Instead of the facial keypoints of a face, if an input image is provided, the face and the facial keypoints are automatically detected and used to run the face mesh fitting. When run this way, if BoundingBoxes, Landmarks, or both are set as optional output properties for this feature, these properties are populated with the bounding box that contains the face and the detected facial keypoints, respectively.

FaceMesh and RenderingParams are not optional properties for this feature. To run the feature, these properties must be set with user-provided output buffers.

Additionally, if this feature is run without providing facial keypoints as an input, the path pointed to by the ModelDir configuration property must also contain the face and landmark detection TRT package files. Optionally, CUDAStream and the Temporal flag can be set for those features.

The expression coefficients can be used to drive the expressions of an avatar.

Face 3D Mesh Tracking for Temporal Frames (Videos)#

If the Temporal flag is set and face and landmark detection are run internally, these features are optimized for temporally related frames.

This means that face and facial keypoints are tracked across frames, and only one bounding box is returned, if requested, as an output. The Temporal flag is not supported by the Face 3D Mesh feature if the Landmark Detection or Face Detection features are called explicitly. In that case, you must provide the flag directly to those features.

Note

The facial keypoints and the face bounding box that were determined internally can be queried from this feature but are not required for the feature to run.

The following example uses the Mesh Tracking AR feature to obtain the face mesh directly from the image, without explicitly running Landmark Detection or Face Detection:

//Set input image buffer instead of providing facial keypoints
NvAR_SetObject(faceFitHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for face mesh
NvAR_FaceMesh face_mesh = new NvAR_FaceMesh();

unsigned int n;
err = NvAR_GetU32(faceFitHandle, NvAR_Parameter_Config(VertexCount),
  &n);
face_mesh->num_vertices = n;
err = NvAR_GetU32(faceFitHandle, NvAR_Parameter_Config(TriangleCount),
  &n);
face_mesh->num_triangles = n;
face_mesh->vertices = new NvAR_Vector3f[face_mesh->num_vertices];

face_mesh->tvi = new NvAR_Vector3u16[face_mesh->num_triangles];

NvAR_SetObject(faceFitHandle, NvAR_Parameter_Output(FaceMesh),
  face_mesh, sizeof(NvAR_FaceMesh));

//Set output memory for rendering parameters
NvAR_RenderingParams rendering_params = new NvAR_RenderingParams();

NvAR_SetObject(faceFitHandle, NvAR_Parameter_Output(RenderingParams),
  rendering_params, sizeof(NvAR_RenderingParams));

//Optional: Set facial keypoints as an output
NvAR_SetObject(faceFitHandle, NvAR_Parameter_Output(Landmarks),
  facial_landmarks.data(),sizeof(NvAR_Point2f));

//Optional: Set output memory for bounding boxes or other parameters,
//such as pose, bounding box confidence, or landmarks confidence

NvAR_Run(faceFitHandle);

Eye Contact#

This feature estimates the gaze of a person from an eye patch that was extracted by using landmarks and redirects the eyes to make the person look at the camera in a permissible range of eye and head angles. The feature also supports a mode where the estimation can be obtained without redirection. The eye contact feature can be invoked by using the GazeRedirection feature ID.

Eye contact feature has the following options:

  • Gaze Estimation

  • Gaze Redirection

In this release, gaze estimation and redirection of only one face in the frame is supported.

Gaze Estimation#

The estimation of gaze requires face detection and landmarks as input. The inputs to the gaze estimator are an input image buffer and buffers to hold facial landmarks and confidence scores. The output of gaze estimation is the gaze vector (pitch, yaw) values in radians. A float array must be set as the output buffer to hold estimated gaze. The GazeRedirect parameter must be set to false.

The following example runs the Gaze Estimation with an input image buffer and output memory to hold the estimated gaze vector:

bool bGazeRedirect=false;
NvAR_SetU32(gazeRedirectHandle, NvAR_Parameter_Config(GazeRedirect),
  bGazeRedirect);

//Set input image buffer
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for gaze vector
float gaze_angles_vector[2];
NvvAR_SetF32Array(gazeRedirectHandle,
  NvAR_Parameter_Output(OutputGazeVector), gaze_angles_vector, batchSize
  * 2);

//Optional: Set output memory for landmarks, head pose, head
//translation, and gaze direction
std::vector<NvAR_Point2f> facial_landmarks;
facial_landmarks.assign(batchSize * OUTPUT_SIZE_KPTS, {0.f, 0.f});
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Output(Landmarks),
  facial_landmarks.data(),sizeof(NvAR_Point2f));

NvAR_Quaternion head_pose;
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Output(HeadPose),
  &head_pose, sizeof(NvAR_Quaternion));

float head_translation[3] = {0.f};
NvAR_SetF32Array(gazeRedirectHandle,
  NvAR_Parameter_Output(OutputHeadTranslation), head_translation,
  batchSize * 3);

NvAR_Run(gazeRedirectHandle);

Gaze Redirection#

Gaze Redirection takes identical inputs as the gaze estimation. In addition to the outputs of gaze estimation, to store the gaze redirected image, an output image buffer of the same size as the input image buffer must be set. The gaze is redirected to look at the camera within a certain range of gaze angles and head poses. Outside this range, the feature disengages. Head pose, head translation, and gaze direction can be optionally set as outputs. The GazeRedirect parameter must be set to true.

The following example runs Gaze Redirection with an input image buffer and output memory to hold the estimated gaze vector and an output image buffer to hold the gaze redirected image.

bool bGazeRedirect=true;
NvAR_SetU32(gazeRedirectHandle, NvAR_Parameter_Config(GazeRedirect),
  bGazeRedirect);

//Set input image buffer
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for gaze vector
float gaze_angles_vector[2];
NvvAR_SetF32Array(gazeRedirectHandle,
  NvAR_Parameter_Output(OutputGazeVector), gaze_angles_vector, batchSize
  * 2);

//Set output image buffer
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Output(Image),
  &outputImageBuffer, sizeof(NvCVImage));

NvAR_Run(gazeRedirectHandle);

Randomized Look Away#

A continuous redirection of gaze to look at the camera might give a perception of “stare.” Some users might find this effect unnatural or undesired. In order to occasionally break eye contact, we provide randomized look aways in gaze redirection which can be optionally enabled. While the gaze is always expected to redirect toward the camera within the range of operation, enabling look away will make the user occasionally break gaze lock to the camera with a micro-movement of the eyes at randomly chosen time intervals. The EnableLookAway parameter must be set to true to enable this feature. Additionally, parameters LookAwayOffsetMax, LookAwayIntervalMin, and LookAwayIntervalRange are optional parameters that can be used to tune the extent and frequency of look away. For a detailed description and default settings of these parameters, see Configuration Properties for Eye Contact.

Range Control#

The gaze redirection feature redirects the eyes to look at the camera within a certain range of head and eye motion in which eye contact is desired and looks natural. Beyond this range, the feature gradually transitions away from looking at the camera toward the estimated gaze and eventually turns off in a seamless manner. To provide for various use cases and user preferences, we provide range parameters for the user to control the range of gaze angles and head poses in which gaze redirection occurs and the range in which transition occurs before the redirection is turned off. These are optional parameters.

GazePitchThresholdLow and GazeYawThresholdLow define the parameters for the pitch and yaw angles of the estimated gaze within which gaze is redirected toward the camera. Beyond these angles, redirected gaze transitions away from the camera and toward the estimated gaze, turning off redirection beyond GazePitchThresholdHigh and GazeYawThresholdHigh, respectively. Similarly, for head pose, HeadPitchThresholdLow and HeadYawThresholdLow define the parameters for pitch and yaw angles of the head pose within which gaze is redirected toward the camera. Beyond these angles, redirected gaze transitions away from the camera and toward the estimated gaze, turning off redirection beyond HeadPitchThresholdHigh and HeadYawThresholdHigh. For a detailed description and default settings of these parameters, see Configuration Properties for Eye Contact.

3D Body Pose Tracking#

This feature relies on temporal information to track the person in the scene, where the keypoints information from the previous frame is used to estimate the keypoints of the next frame.

3D Body Pose Tracking consists of the following parts:

  • Body Detection

  • 3D Keypoint Detection

The feature supports single or multiple people in the frame and both full or upper-body images and videos.

3D Body Pose Tracking for Static Frames (Images)#

You can obtain the bounding boxes that encapsulate the people in the scene. To obtain detected bounding boxes, you can explicitly instantiate and run body detection and pass the image buffer as input.

The following example runs the Body Detection with an input image buffer and output memory to hold bounding boxes:

//Set input image buffer
NvAR_SetObject(bodyDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(bodyDetectHandle, NvAR_Parameter_Output(BoundingBoxes),
  &output_bboxes, sizeof(NvAR_BBoxes));

//Optional: If desired, set memory for bounding-box confidence values

NvAR_Run(bodyDetectHandle);

The input to 3D Body Keypoint Detection is an input image. It outputs the 2D keypoints, 3D keypoints, keypoint confidence scores, and bounding box encapsulating the person.

The following example runs the 3D Body Pose Detection AR feature:

//Set input image buffer
NvAR_SetObject(keypointDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Pass output bounding boxes from body detection as an input on which
//landmark detection is to be run
NvAR_SetObject(keypointDetectHandle,
  NvAR_Parameter_Input(BoundingBoxes), &output_bboxes,
  sizeof(NvAR_BBoxes));

//Set output buffer to hold detected keypoints
std::vector<NvAR_Point2f> keypoints;
std::vector<NvAR_Point3f> keypoints3D;
std::vector<NvAR_Point3f> jointAngles;
std::vector<float> keypoints_confidence;

// Get the number of keypoints
unsigned int numKeyPoints;

NvAR_GetU32(keyPointDetectHandle, NvAR_Parameter_Config(NumKeyPoints),
  &numKeyPoints);
keypoints.assign(batchSize * numKeyPoints , {0.f, 0.f});
keypoints3D.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
jointAngles.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints),
  keypoints.data(), sizeof(NvAR_Point2f));
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPoints3D), keypoints3D.data(),
  sizeof(NvAR_Point3f));
NvAR_SetF32Array(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPointsConfidence),
  keypoints_confidence.data(), batchSize * numKeyPoints);
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(JointAngles), jointAngles.data(),
  sizeof(NvAR_Point3f));

//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(BoundingBoxes), &output_bboxes,
  sizeof(NvAR_BBoxes));

NvAR_Run(keyPointDetectHandle);

3D Body Pose Tracking for Temporal Frames (Videos)#

The feature relies on temporal information to track the person in the scene. The keypoints information from the previous frame is used to estimate the keypoints of the next frame.

The following example uses the 3D Body Pose Tracking AR feature to obtain 3D Body Pose Keypoints directly from the image:

//Set input image buffer
NvAR_SetObject(keypointDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Pass output bounding boxes from body detection as an input on which
//landmark detection is to be run
NvAR_SetObject(keypointDetectHandle,
  NvAR_Parameter_Input(BoundingBoxes), &output_bboxes,
  sizeof(NvAR_BBoxes));

//Set output buffer to hold detected keypoints
std::vector<NvAR_Point2f> keypoints;
std::vector<NvAR_Point3f> keypoints3D;
std::vector<NvAR_Point3f> jointAngles;
std::vector<float> keypoints_confidence;

// Get the number of keypoints
unsigned int numKeyPoints;
NvAR_GetU32(keyPointDetectHandle, NvAR_Parameter_Config(NumKeyPoints),
  &numKeyPoints);
keypoints.assign(batchSize * numKeyPoints , {0.f, 0.f});
keypoints3D.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
jointAngles.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints),
  keypoints.data(), sizeof(NvAR_Point2f));
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPoints3D), keypoints3D.data(),
  sizeof(NvAR_Point3f));
NvAR_SetF32Array(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPointsConfidence),
  keypoints_confidence.data(), batchSize * numKeyPoints);
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(JointAngles), jointAngles.data(),
  sizeof(NvAR_Point3f));

//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(BoundingBoxes), &output_bboxes,
  sizeof(NvAR_BBoxes));

NvAR_Run(keyPointDetectHandle);

Multi-Person Tracking for 3D Body Pose Tracking#

The feature provides the ability to track multiple people in the following ways:

  • In the scene across different frames.

  • When they leave the scene and enter the scene again.

  • When they are completely occluded by an object or another person and reappear (controlled using Shadow Tracking Age).

Shadow Tracking Age is a parameter that represents the period of time where a target is still being tracked in the background even when the target is not associated with a detector object. When a target is not associated with a detector object for a time frame, shadowTrackingAge, an internal variable of the target, is incremented. After the target is associated with a detector object, shadowTrackingAge is reset to zero. When the target age reaches the shadow tracking age, the target is discarded and is no longer tracked. This is measured by the number of frames; the default is 90.

Probation Age is the length of probationary period. After an object reaches this age, it is considered to be valid and is appointed an ID. This helps with false positives, where false objects are detected for only a few frames. This is measured by the number of frames; the default is 10.

Maximum Targets Tracked is the maximum number of targets to be tracked, which can be composed of the targets that are active in the frame and ones in shadow-tracking mode. When you select this value, keep the active and inactive targets in mind. The minimum is 1 and the default is 30.

Note

Currently, we actively track only eight people in the scene. More than eight people can appear throughout the video, but only a maximum of eight people in a given frame. Temporal mode is not supported for Multi-Person Tracking. The batch size should be 8 when Multi-Person Tracking is enabled.

The following example uses the 3D Body Pose Tracking AR feature to enable multi-person tracking and obtain the tracking ID for each person:

// Set input image buffer
NvAR_SetObject(keypointDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

// Enable Multi-Person Tracking
NvAR_SetU32(keyPointDetectHandle, NvAR_Parameter_Config(TrackPeople),
  bEnablePeopleTracking);

// Set Shadow Tracking Age
NvAR_SetU32(keyPointDetectHandle,
  NvAR_Parameter_Config(ShadowTrackingAge), shadowTrackingAge);

// Set Probation Age
NvAR_SetU32(keyPointDetectHandle, NvAR_Parameter_Config(ProbationAge),
  probationAge);

// Set Maximum Targets to be tracked
NvAR_SetU32(keyPointDetectHandle,
  NvAR_Parameter_Config(MaxTargetsTracked), maxTargetsTracked);

// Set output buffer to hold detected keypoints
std::vector<NvAR_Point2f> keypoints;
std::vector<NvAR_Point3f> keypoints3D;
std::vector<NvAR_Point3f> jointAngles;
std::vector<float> keypoints_confidence;

// Get the number of keypoints
unsigned int numKeyPoints;
NvAR_GetU32(keyPointDetectHandle, NvAR_Parameter_Config(NumKeyPoints),
  &numKeyPoints);
keypoints.assign(batchSize * numKeyPoints , {0.f, 0.f});
keypoints3D.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
jointAngles.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints),
  keypoints.data(), sizeof(NvAR_Point2f));
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPoints3D), keypoints3D.data(),
  sizeof(NvAR_Point3f));
NvAR_SetF32Array(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPointsConfidence),
  keypoints_confidence.data(), batchSize * numKeyPoints);
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(JointAngles), jointAngles.data(),
  sizeof(NvAR_Point3f));

// Set output memory for tracking bounding boxes
NvAR_TrackingBBoxes output_tracking_bboxes{};
std::vector<NvAR_TrackingBBox> output_tracking_bbox_data;
output_tracking_bbox_data.assign(maxTargetsTracked, { 0.f, 0.f, 0.f,
  0.f, 0 });
output_tracking_bboxes.boxes = output_tracking_bbox_data.data();
output_tracking_bboxes.max_boxes = (uint8_t)output_tracking_bbox_size;

NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(TrackingBoundingBoxes), &output_tracking_bboxes,
  sizeof(NvAR_TrackingBBoxes));

NvAR_Run(keyPointDetectHandle);

Facial Expression Estimation#

This section provides information about how to use the Facial Expression Estimation feature.

Facial Expression Estimation for Static Frames (Images)#

Typically, the input to the Facial Expression Estimation feature is an input image and a set of detected landmark points that correspond to the face on which we want to estimate face expression coefficients.

The following example shows the typical usage of this feature, where the detected facial keypoints from the Landmark Detection feature are passed as input to this feature:

//Set facial keypoints from Landmark Detection as an input
err = NvAR_SetObject(faceExpressionHandle,
  NvAR_Parameter_Input(Landmarks), facial_landmarks.data(),
  sizeof(NvAR_Point2f));

//Set output memory for expression coefficients
unsigned int expressionCount;
err = NvAR_GetU32(faceExpressionHandle,
  NvAR_Parameter_Config(ExpressionCount), &expressionCount);

float expressionCoeffs = new float[expressionCount];
err = NvAR_SetF32Array(faceExpressionHandle,
  NvAR_Parameter_Output(ExpressionCoefficients), expressionCoeffs,
  expressionCount);

//Set output memory for pose rotation quaternion
NvAR_Quaternion pose = new NvAR_Quaternion();
err = NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Output(Pose),
  pose, sizeof(NvAR_Quaternion));

//Optional: If desired, set memory for bounding boxes and their confidences

err = NvAR_Run(faceExpressionHandle);

Alternative Usage of the Facial Expression Estimation Feature#

Like the alternative usage of the Landmark detection feature and the Face 3D Mesh Feature, the Facial Expression Estimation feature can be used to determine the detected face bounding box, the facial keypoints, and a 3D-face mesh and its rendering parameters.

When you provide an input image instead of the facial keypoints of a face, the face and the facial keypoints are automatically detected and are used to run the expression estimation. This way, if BoundingBoxes, Landmarks, or both are set as optional output properties for this feature, these properties are populated with the bounding box that contains the face and the detected facial keypoints.

ExpressionCoefficients and Pose are not optional properties for this feature. To run the feature, these properties must be set with user-provided output buffers. If this feature is also run without providing facial keypoints as an input, the path to which the ModelDir configuration property points must also contain the face and landmark detection TRT package files. Optionally, CUDAStream and the Temporal flag can be set for those features.

The expression coefficients can be used to drive the expressions of an avatar.

Note

The facial keypoints or the face bounding box that were determined internally can be queried from this feature but are not required for the feature to run.

The following example uses the Facial Expression Estimation feature to obtain the face expression coefficients directly from the image, without explicitly running Landmark Detection or Face Detection:

//Set input image buffer instead of providing facial keypoints
NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for expression coefficients
unsigned int expressionCount;
err = NvAR_GetU32(faceExpressionHandle,
  NvAR_Parameter_Config(ExpressionCount), &expressionCount);
float expressionCoeffs = new float[expressionCount];
err = NvAR_SetF32Array(faceExpressionHandle,
  NvAR_Parameter_Output(ExpressionCoefficients), expressionCoeffs,
  expressionCount);

//Set output memory for pose rotation quaternion
NvAR_Quaternion pose = new NvAR_Quaternion();
err = NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Output(Pose),
  pose, sizeof(NvAR_Quaternion));

//Optional: Set facial keypoints as an output
NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Output(Landmarks),
  facial_landmarks.data(),sizeof(NvAR_Point2f));

//Optional: Set output memory for bounding boxes or other parameters,
//such as pose, bounding box confidence, and landmarks confidence

NvAR_Run(faceExpressionHandle);

Facial Expression Estimation Tracking for Temporal Frames (Videos)#

If the Temporal flag is set and face and landmark detection are run internally, these features are optimized for temporally related frames. This means that face and facial keypoints are tracked across frames and, if requested, only one bounding box is returned as an output. If the Face Detection and Landmark Detection features are explicitly used, they need their own Temporal flags to be set. This flag also affects the Facial Expression Estimation feature through the NVAR_TEMPORAL_FILTER_FACIAL_EXPRESSIONS, NVAR_TEMPORAL_FILTER_FACIAL_GAZE, and NVAR_TEMPORAL_FILTER_ENHANCE_EXPRESSIONS bits.

Video Live Portrait#

This section provides information about how to use the Video Live Portrait feature. Video Live Portrait, also known as Photo Animation, animates a person’s portrait photo (the source image) using a driving video by matching the head movement and facial expressions in it.

Video Live Portrait Mode#

The feature supports three modes:

  • Mode 1 (Native mode): A face crop is extracted from the portrait photo. Video Live Portrait drives the face crop and outputs images with a fixed resolution of 512 × 512 (with performance model) or 1024 × 1024 (with quality model).

  • Mode 2 (Registration & blending mode): A face crop is extracted from the portrait photo. Video Live Portrait drives the face crop. The animated crop is registered and blended back into the portrait photo. The result covers the full portrait photo, maintaining the same resolution as the original. If the portrait photo includes only the face and shoulders, we recommend using mode 1 for best results.

  • Mode 3 (Inset & blending mode): Mode 3 is primarily useful for face-animation workflows like narration use cases where the driving video has only limited or stable head positions and movements. Unlike mode 2, this mode does not account for the interplay between head and body/neck movements, offering only a blended output.

The following example uses mode 2 to generate output:

// Allocate source image
NvCVImage_Realloc(&source_image, source_width, source_height, NVCV_BGR,
  NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Allocate driving image (not necessarily the same resolution as the
// source image)
NvCVImage_Realloc(&drive_image, drive_width, drive_height, NVCV_BGR,
  NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Allocate generated image (same resolution as the source image)
NvCVImage_Realloc(&gen_image, source_width, source_height, NVCV_BGR,
  NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Set mode 2
NvAR_SetU32(lp_han, NvAR_Parameter_Config(Mode), 2);

// load is required before setting Images
NvAR_Load(lp_han);

// Set input and output images
NvAR_SetObject(lp_han, NvAR_Parameter_Input(SourceImage), &source_image,
  sizeof(NvCVImage));

NvAR_SetObject(lp_han, NvAR_Parameter_Input(DriveImage), &drive_image,
  sizeof(NvCVImage));

NvAR_SetObject(lp_han, NvAR_Parameter_Output(GeneratedImage),
  &gen_image, sizeof(NvCVImage));

NvAR_Run(lp_han);

Neutral Drive Image Reset#

We provide an API named NeutralDriveImage that can be used to update the neutral drive image on the fly. Calling NeutralDriveImage is required only if the neutral drive image is reset from a previous one. Simply setting DriveImage as demonstrated in the preceding section is sufficient if the first drive image should be used as the neutral.

NvAR_SetObject(lp_han, NvAR_Parameter_Input(NeutralDriveImage),
  neutral_drive_image,sizeof(NvCVImage));

For more information on selecting a neutral driving image, see Frame Selection.

Bounding Boxes and Temporal Stability#

You can obtain the face bounding boxes that encapsulate the people in the driving video. To obtain the detected face bounding boxes, you can explicitly instantiate and run Video Live Portrait as shown in the following example:

// Create bounding box objects
NvAR_BBoxes *m_bboxes = new NvAR_BBoxes;
std::vector<NvAR_Rect> m_face_boxes_data(25);
m_bboxes->boxes = m_face_boxes_data.data();
m_bboxes->max_boxes = 25;
m_bboxes->num_boxes = 0;

// Set LivePortrait output
NvAR_SetObject(lp_han, NvAR_Parameter_Output(BoundingBoxes), m_bboxes,
  sizeof(NvAR_BBoxes));

// The m_bboxes are populated after each Run call

For best temporal stability, Video Live Portrait keeps reusing the first detected face bounding box in the driving video as long as the person stays inside of it. Otherwise, the bounding box is re-triggered and updated.

RGBA Source Image#

Video Live Portrait supports an RGBA source image as input, with the alpha channel being used as a segmentation mask. The generated image is also RGBA format with the alpha channel representing a segmentation mask. This enables the application to replace the original background with a customer-supplied background image. This ability is particularly useful when the background in the source image is busy.

The following example shows how to create BGRA NvCVImage for both the source image and the generated image. The driving image does not need to be RGBA format.

// Allocate source image
NvCVImage_Realloc(&source_image, source_width, source_height, NVCV_BGRA,
  NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Allocate generated image (same resolution as the source image in
// mode 2 or mode 3)
NvCVImage_Realloc(&gen_image, source_width, source_height, NVCV_BGRA,
  NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

The Video Live Portrait sample application demonstrates how to do background replacement using the RGBA generated image. Refer to it for the details.

Frame Selection#

To ensure good quality in the Video Live Portrait effect, the driving video should start with a front-facing face, frontal gaze, and neutral expression. This can be accomplished by the following:

  1. Advise the person in the driving video to keep a front-facing face, frontal gaze, and neutral expression in the very first frame.

  2. Use the Frame Selection effect from the AR SDK.

Frame Selection is a standalone effect in the AR SDK that can help pick a good neutral frame from a video. Your application can query the Frame Selection status until a good neutral status is obtained. Then your application can feed that frame as the first driving frame to start Video Live Portrait.

The following example demonstrates how to use Frame Selection to query the frame neutrality status. For more information on Frame Selection status codes, refer to NvAR_defs.h.

// The minimum gap (in frames) between two good neutral frames (to
// avoid frequent updating)
NvAR_SetU32(fs_han, NvAR_Parameter_Config(GoodFrameMinInterval), 30);

// Frame Selection will return expired status before reaching the number
// of ActiveDuration frames
NvAR_SetU32(fs_han, NvAR_Parameter_Config(ActiveDuration), 150);

// Strategy 1: Improving threshold
// Strategy 0: Fixed threshold
NvAR_SetU32(fs_han, NvAR_Parameter_Config(Strategy), 1);

// Load
NvAR_Load(fs_han);

// Set Frame Selection input image
NvAR_SetObject(fs_han, NvAR_Parameter_Input(Image), &drive_image,
  sizeof(NvCVImage));

// Run Frame Selection effect
NvAR_Run(fs_han);

// Get the Frame Selection status of current input image
NvAR_GetU32(fs_han, NvAR_Parameter_Output(FrameSelectorStatus),
  &frame_selector_status);

// Check whether the current frame is a good neutral frame. If yes,
// reset the neutral driving image of the Video Live Portrait effect.
if (frame_selector_status == NVAR_FRAME_SELECTOR_SUCCESS){
  NvAR_SetObject(lp_han, NvAR_Parameter_Input(NeutralDriveImage),
    &current_image,sizeof(NvCVImage));
}

Speech Live Portrait#

This section provides information about how to use the Speech Live Portrait feature. Speech Live Portrait animates a person’s portrait photo (the source image) using an audio input by animating the lip motion to match that of the audio.

Speech Live Portrait Mode#

The feature supports three modes:

  • Mode 1 (Native mode): A face crop is extracted from the portrait photo. Speech Live Portrait drives the face crop and outputs images with fixed resolution of 512 × 512 (for both the performance model and quality model).

  • Mode 2 (Registration and blending mode): A face crop is extracted from the portrait photo. Speech Live Portrait drives the face crop. The animated crop is registered and blended back into the portrait photo. The output image includes both the animated face crop and the surrounding area, with the same resolution as the portrait photo. The registration algorithm considers the relationships between the face crop and body/neck movements. If the portrait photo includes only the face and shoulders, we recommend using Mode 1 for best results.

  • Mode 3 (Inset and blending mode): Mode 3 is a lightweight and faster version of mode 2, without registration. The resulting output also covers the full portrait photo. If quality is the primary concern, mode 2 is usually preferred over mode 3.

The following example uses mode 2 to generate output:

// Allocate source image
NvCVImage_Realloc(&source_image, source_width, source_height, NVCV_BGR,
  NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Allocate generated image (same resolution as the source image)
NvCVImage_Realloc(&gen_image, source_width, source_height, NVCV_BGR,
  NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Set mode 2
NvAR_SetU32(slp_han, NvAR_Parameter_Config(Mode), 2);

// load is required before setting images
NvAR_Load(slp_han);

// Set input and output images
NvAR_SetObject(slp_han, NvAR_Parameter_Input(SourceImage),
  &source_image, sizeof(NvCVImage));

NvAR_SetF32Array(slp_han, NvAR_Parameter_Input(AudioFrameBuffer),
  audio_frame.data(), audio_frame.size());

NvAR_SetObject(slp_han, NvAR_Parameter_Output(GeneratedImage),
  &gen_image, sizeof(NvCVImage));

NvAR_Run(slp_han);

LipSync#

This section provides information about how to use the LipSync feature. LipSync uses an audio input to modify a video of a person, animating the person’s lips and lower face to match the audio.

LipSync Processing#

The LipSync feature takes synchronized audio samples and video frames as inputs, and produces modified video frames as output. The following example demonstrates how to process video and audio frames with the LipSync feature:

// Allocate source image
NvCVImage_Realloc(&source_image, source_width, source_height, NVCV_BGR, NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Allocate source audio frame
std::vector<float> audio_frame(audio_frame_length);

// Allocate generated image (same resolution as the source image)
NvCVImage_Realloc(&gen_image, source_width, source_height, NVCV_BGR, NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Load is required before setting images
NvAR_Load(lipsync_han);

// Set input and output images
NvAR_SetObject(lipsync_han, NvAR_Parameter_Input(Image), &source_image, sizeof(NvCVImage));
NvAR_SetF32Array(lipsync_han, NvAR_Parameter_Input(AudioFrameBuffer), audio_frame.data(), audio_frame.size());
NvAR_SetObject(lipsync_han, NvAR_Parameter_Output(Image), &gen_image, sizeof(NvCVImage));

// Run the feature
NvAR_Run(lipsync_han);

Input/Output Latency#

Between the input and output video frames is a fixed amount of latency; that is, at the beginning of a video, the feature reads a fixed number of input frames before it generates the first output video frame. At the end of a video, the feature generates the final output frames without requiring new input video frames.

The following example shows how to query the latency between input and output frames.

// Number of initial input video frames to process before retrieving an output frame.
uint32_t latency_frame_cnt = 0;
NvAR_GetU32(lipsync_han, NvAR_Parameter_Config(NumInitialFrames), &latency_frame_cnt);

An output parameter indicates whether the first output frame has been generated.

unsigned int output_ready = 0;
NvAR_SetU32Array(lipsync_han, NvAR_Parameter_Output(Ready), &output_ready, 1);