Using the AR Features#

This section provides information about how to use the AR features.

Face Detection and Tracking#

This section provides information about how to use the Face Detection and Tracking feature.

Face Detection for Static Frames (Images)#

To obtain detected bounding boxes, you can explicitly instantiate and run the face detection feature as follows, with the feature taking an image buffer as input.

The following example runs the Face Detection AR feature with an input image buffer and output memory to hold bounding boxes:

//Set input image buffer
NvAR_SetObject(faceDetectHandle, NvAR_Parameter_Input(Image), &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(faceDetectHandle, NvAR_Parameter_Output(BoundingBoxes), &output_bboxes, sizeof(NvAR_BBoxes));

//Optional: If desired, set memory for bounding-box confidence values
NvAR_Run(faceDetectHandle);

Face Tracking for Temporal Frames (Videos)#

If Temporal is enabled, such as when you process a video frame instead of an image, only one face is returned. The largest face appears for the first frame, and this face is subsequently tracked over the following frames.

However, explicitly calling the face detection feature is not the only way to obtain a bounding box that denotes detected faces. For more information about how to use the Landmark Detection AR feature and return a face bounding box, see Landmark Detection and Tracking.

Landmark Detection and Tracking#

This section provides information about how to use the Landmark Detection and Tracking feature.

Landmark Detection for Static Frames (Images)#

Typically, the input to the landmark detection feature is an input image and a batch of bounding boxes. Currently, the maximum value is 1. These boxes denote the regions of the image that contain the faces on which you want to run landmark detection.

The following example runs the Landmark Detection AR feature after obtaining bounding boxes from Face Detection:

//Set input image buffer
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Pass output bounding boxes from face detection as an input on which
//landmark detection is to be run
NvAR_SetObject(landmarkDetectHandle,
  NvAR_Parameter_Input(BoundingBoxes), &output_bboxes,
  sizeof(NvAR_BBoxes));

//Set landmark detection mode: Performance (0; default) or Quality (1)
unsigned int mode = 0; // Choose performance mode
NvAR_SetU32(landmarkDetectHandle, NvAR_Parameter_Config(Mode), mode);

//Set output buffer to hold detected facial keypoints
std::vector<NvAR_Point2f> facial_landmarks;
facial_landmarks.assign(OUTPUT_SIZE_KPTS, {0.f, 0.f});
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Output(Landmarks),
  facial_landmarks.data(),sizeof(NvAR_Point2f));

NvAR_Run(landmarkDetectHandle);

Alternative Usage of Landmark Detection#

As described in Configuration Properties for Landmark Tracking, the Landmark Detection AR feature supports some optional parameters that determine how the feature can be run.

If bounding boxes are not provided to the Landmark Detection AR feature as inputs, face detection is automatically run on the input image, and the largest face bounding box is selected on which to run landmark detection.

If BoundingBoxes is set as an output property, the property is populated with the selected bounding box that contains the face on which the landmark detection was run. Landmarks is not an optional property; to explicitly run this feature, this property must be set with a provided output buffer.

Landmark Tracking for Temporal Frames (Videos)#

Additionally, if Temporal is enabled, such as when you process a video stream and face detection is run explicitly, only one bounding box is supported as an input for landmark detection.

When face detection is not explicitly run, by providing an input image instead of a bounding box, the largest detected face is automatically selected. The detected face and landmarks are then tracked as an optimization across temporally related frames.

Note

The internally determined bounding box can be queried from this feature, but is not required for the feature to run.

The following example uses the Landmark Detection AR feature to obtain landmarks directly from the image, without first explicitly running Face Detection:

//Set input image buffer
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for landmarks
std::vector<NvAR_Point2f> facial_landmarks;
facial_landmarks.assign(batchSize * OUTPUT_SIZE_KPTS, {0.f, 0.f});
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Output(Landmarks),
  facial_landmarks.data(),sizeof(NvAR_Point2f));

//Set landmark detection mode: Performance (0; default) or Quality (1)
unsigned int mode = 0; // Choose performance mode
NvAR_SetU32(landmarkDetectHandle, NvAR_Parameter_Config(Mode), mode);

//Optional: If desired, set memory for bounding box
NvAr_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(landmarkDetectHandle,
  NvAR_Parameter_Output(BoundingBoxes), &output_bboxes,
  sizeof(NvAr_BBoxes));

//Optional: If desired, set memory for pose, landmark confidence, or
//even bounding box confidence

NvAR_Run(landmarkDetectHandle);

Eye Contact#

This feature estimates the gaze of a person from an eye patch that was extracted by using landmarks and redirects the eyes to make the person look at the camera in a permissible range of eye and head angles. The feature also supports a mode where the estimation can be obtained without redirection. The eye contact feature can be invoked by using the GazeRedirection feature ID.

Eye contact feature has the following options:

Gaze Estimation
Gaze Redirection

In this release, gaze estimation and redirection of only one face in the frame is supported.

Gaze Estimation#

The estimation of gaze requires face detection and landmarks as input. The inputs to the gaze estimator are an input image buffer and buffers to hold facial landmarks and confidence scores. The output of gaze estimation is the gaze vector (pitch, yaw) values in radians. A float array must be set as the output buffer to hold estimated gaze. The GazeRedirect parameter must be set to false.

The following example runs the Gaze Estimation with an input image buffer and output memory to hold the estimated gaze vector:

bool bGazeRedirect=false;
NvAR_SetU32(gazeRedirectHandle, NvAR_Parameter_Config(GazeRedirect),
  bGazeRedirect);

//Set input image buffer
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for gaze vector
float gaze_angles_vector[2];
NvvAR_SetF32Array(gazeRedirectHandle,
  NvAR_Parameter_Output(OutputGazeVector), gaze_angles_vector, batchSize
  * 2);

//Optional: Set output memory for landmarks, head pose, head
//translation, and gaze direction
std::vector<NvAR_Point2f> facial_landmarks;
facial_landmarks.assign(batchSize * OUTPUT_SIZE_KPTS, {0.f, 0.f});
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Output(Landmarks),
  facial_landmarks.data(),sizeof(NvAR_Point2f));

NvAR_Quaternion head_pose;
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Output(HeadPose),
  &head_pose, sizeof(NvAR_Quaternion));

float head_translation[3] = {0.f};
NvAR_SetF32Array(gazeRedirectHandle,
  NvAR_Parameter_Output(OutputHeadTranslation), head_translation,
  batchSize * 3);

NvAR_Run(gazeRedirectHandle);

Gaze Redirection#

Gaze Redirection takes identical inputs as the gaze estimation. In addition to the outputs of gaze estimation, to store the gaze redirected image, an output image buffer of the same size as the input image buffer must be set. The gaze is redirected to look at the camera within a certain range of gaze angles and head poses. Outside this range, the feature disengages. Head pose, head translation, and gaze direction can be optionally set as outputs. The GazeRedirect parameter must be set to true.

The following example runs Gaze Redirection with an input image buffer and output memory to hold the estimated gaze vector and an output image buffer to hold the gaze redirected image.

bool bGazeRedirect=true;
NvAR_SetU32(gazeRedirectHandle, NvAR_Parameter_Config(GazeRedirect),
  bGazeRedirect);

//Set input image buffer
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for gaze vector
float gaze_angles_vector[2];
NvvAR_SetF32Array(gazeRedirectHandle,
  NvAR_Parameter_Output(OutputGazeVector), gaze_angles_vector, batchSize
  * 2);

//Set output image buffer
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Output(Image),
  &outputImageBuffer, sizeof(NvCVImage));

NvAR_Run(gazeRedirectHandle);

Randomized Look Away#

A continuous redirection of gaze to look at the camera might give a perception of “stare.” Some users might find this effect unnatural or undesired. In order to occasionally break eye contact, we provide randomized look aways in gaze redirection which can be optionally enabled. While the gaze is always expected to redirect toward the camera within the range of operation, enabling look away will make the user occasionally break gaze lock to the camera with a micro-movement of the eyes at randomly chosen time intervals. The EnableLookAway parameter must be set to true to enable this feature. Additionally, parameters LookAwayOffsetMax, LookAwayIntervalMin, and LookAwayIntervalRange are optional parameters that can be used to tune the extent and frequency of look away. For a detailed description and default settings of these parameters, see Configuration Properties for Eye Contact.

Range Control#

The gaze redirection feature redirects the eyes to look at the camera within a certain range of head and eye motion in which eye contact is desired and looks natural. Beyond this range, the feature gradually transitions away from looking at the camera toward the estimated gaze and eventually turns off in a seamless manner. To provide for various use cases and user preferences, we provide range parameters for the user to control the range of gaze angles and head poses in which gaze redirection occurs and the range in which transition occurs before the redirection is turned off. These are optional parameters.

GazePitchThresholdLow and GazeYawThresholdLow define the parameters for the pitch and yaw angles of the estimated gaze within which gaze is redirected toward the camera. Beyond these angles, redirected gaze transitions away from the camera and toward the estimated gaze, turning off redirection beyond GazePitchThresholdHigh and GazeYawThresholdHigh, respectively. Similarly, for head pose, HeadPitchThresholdLow and HeadYawThresholdLow define the parameters for pitch and yaw angles of the head pose within which gaze is redirected toward the camera. Beyond these angles, redirected gaze transitions away from the camera and toward the estimated gaze, turning off redirection beyond HeadPitchThresholdHigh and HeadYawThresholdHigh. For a detailed description and default settings of these parameters, see Configuration Properties for Eye Contact.

3D Body Pose Tracking#

This feature relies on temporal information to track the person in the scene, where the keypoints information from the previous frame is used to estimate the keypoints of the next frame.

3D Body Pose Tracking consists of the following parts:

Body Detection
3D Keypoint Detection

The feature supports single or multiple people in the frame and both full or upper-body images and videos.

3D Body Pose Tracking for Static Frames (Images)#

You can obtain the bounding boxes that encapsulate the people in the scene. To obtain detected bounding boxes, you can explicitly instantiate and run body detection and pass the image buffer as input.

The following example runs the Body Detection with an input image buffer and output memory to hold bounding boxes:

//Set input image buffer
NvAR_SetObject(bodyDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(bodyDetectHandle, NvAR_Parameter_Output(BoundingBoxes),
  &output_bboxes, sizeof(NvAR_BBoxes));

//Optional: If desired, set memory for bounding-box confidence values

NvAR_Run(bodyDetectHandle);

The input to 3D Body Keypoint Detection is an input image. It outputs the 2D keypoints, 3D keypoints, keypoint confidence scores, and bounding box encapsulating the person.

The following example runs the 3D Body Pose Detection AR feature:

//Set input image buffer
NvAR_SetObject(keypointDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Pass output bounding boxes from body detection as an input on which
//landmark detection is to be run
NvAR_SetObject(keypointDetectHandle,
  NvAR_Parameter_Input(BoundingBoxes), &output_bboxes,
  sizeof(NvAR_BBoxes));

//Set output buffer to hold detected keypoints
std::vector<NvAR_Point2f> keypoints;
std::vector<NvAR_Point3f> keypoints3D;
std::vector<NvAR_Point3f> jointAngles;
std::vector<float> keypoints_confidence;

// Get the number of keypoints
unsigned int numKeyPoints;

NvAR_GetU32(keyPointDetectHandle, NvAR_Parameter_Config(NumKeyPoints),
  &numKeyPoints);
keypoints.assign(batchSize * numKeyPoints , {0.f, 0.f});
keypoints3D.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
jointAngles.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints),
  keypoints.data(), sizeof(NvAR_Point2f));
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPoints3D), keypoints3D.data(),
  sizeof(NvAR_Point3f));
NvAR_SetF32Array(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPointsConfidence),
  keypoints_confidence.data(), batchSize * numKeyPoints);
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(JointAngles), jointAngles.data(),
  sizeof(NvAR_Point3f));

//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(BoundingBoxes), &output_bboxes,
  sizeof(NvAR_BBoxes));

NvAR_Run(keyPointDetectHandle);

3D Body Pose Tracking for Temporal Frames (Videos)#

The feature relies on temporal information to track the person in the scene. The keypoints information from the previous frame is used to estimate the keypoints of the next frame.

The following example uses the 3D Body Pose Tracking AR feature to obtain 3D Body Pose Keypoints directly from the image:

//Set input image buffer
NvAR_SetObject(keypointDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Pass output bounding boxes from body detection as an input on which
//landmark detection is to be run
NvAR_SetObject(keypointDetectHandle,
  NvAR_Parameter_Input(BoundingBoxes), &output_bboxes,
  sizeof(NvAR_BBoxes));

//Set output buffer to hold detected keypoints
std::vector<NvAR_Point2f> keypoints;
std::vector<NvAR_Point3f> keypoints3D;
std::vector<NvAR_Point3f> jointAngles;
std::vector<float> keypoints_confidence;

// Get the number of keypoints
unsigned int numKeyPoints;
NvAR_GetU32(keyPointDetectHandle, NvAR_Parameter_Config(NumKeyPoints),
  &numKeyPoints);
keypoints.assign(batchSize * numKeyPoints , {0.f, 0.f});
keypoints3D.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
jointAngles.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints),
  keypoints.data(), sizeof(NvAR_Point2f));
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPoints3D), keypoints3D.data(),
  sizeof(NvAR_Point3f));
NvAR_SetF32Array(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPointsConfidence),
  keypoints_confidence.data(), batchSize * numKeyPoints);
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(JointAngles), jointAngles.data(),
  sizeof(NvAR_Point3f));

//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(BoundingBoxes), &output_bboxes,
  sizeof(NvAR_BBoxes));

NvAR_Run(keyPointDetectHandle);

Multi-Person Tracking for 3D Body Pose Tracking#

The feature provides the ability to track multiple people in the following ways:

In the scene across different frames.
When they leave the scene and enter the scene again.
When they are completely occluded by an object or another person and reappear (controlled using Shadow Tracking Age).

Shadow Tracking Age is a parameter that represents the period of time where a target is still being tracked in the background even when the target is not associated with a detector object. When a target is not associated with a detector object for a time frame, shadowTrackingAge, an internal variable of the target, is incremented. After the target is associated with a detector object, shadowTrackingAge is reset to zero. When the target age reaches the shadow tracking age, the target is discarded and is no longer tracked. This is measured by the number of frames; the default is 90.

Probation Age is the length of probationary period. After an object reaches this age, it is considered to be valid and is appointed an ID. This helps with false positives, where false objects are detected for only a few frames. This is measured by the number of frames; the default is 10.

Maximum Targets Tracked is the maximum number of targets to be tracked, which can be composed of the targets that are active in the frame and ones in shadow-tracking mode. When you select this value, keep the active and inactive targets in mind. The minimum is 1 and the default is 30.

Note

Currently, we actively track only eight people in the scene. More than eight people can appear throughout the video, but only a maximum of eight people in a given frame. Temporal mode is not supported for Multi-Person Tracking. The batch size should be 8 when Multi-Person Tracking is enabled.

The following example uses the 3D Body Pose Tracking AR feature to enable multi-person tracking and obtain the tracking ID for each person:

// Set input image buffer
NvAR_SetObject(keypointDetectHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

// Enable Multi-Person Tracking
NvAR_SetU32(keyPointDetectHandle, NvAR_Parameter_Config(TrackPeople),
  bEnablePeopleTracking);

// Set Shadow Tracking Age
NvAR_SetU32(keyPointDetectHandle,
  NvAR_Parameter_Config(ShadowTrackingAge), shadowTrackingAge);

// Set Probation Age
NvAR_SetU32(keyPointDetectHandle, NvAR_Parameter_Config(ProbationAge),
  probationAge);

// Set Maximum Targets to be tracked
NvAR_SetU32(keyPointDetectHandle,
  NvAR_Parameter_Config(MaxTargetsTracked), maxTargetsTracked);

// Set output buffer to hold detected keypoints
std::vector<NvAR_Point2f> keypoints;
std::vector<NvAR_Point3f> keypoints3D;
std::vector<NvAR_Point3f> jointAngles;
std::vector<float> keypoints_confidence;

// Get the number of keypoints
unsigned int numKeyPoints;
NvAR_GetU32(keyPointDetectHandle, NvAR_Parameter_Config(NumKeyPoints),
  &numKeyPoints);
keypoints.assign(batchSize * numKeyPoints , {0.f, 0.f});
keypoints3D.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
jointAngles.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints),
  keypoints.data(), sizeof(NvAR_Point2f));
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPoints3D), keypoints3D.data(),
  sizeof(NvAR_Point3f));
NvAR_SetF32Array(keyPointDetectHandle,
  NvAR_Parameter_Output(KeyPointsConfidence),
  keypoints_confidence.data(), batchSize * numKeyPoints);
NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(JointAngles), jointAngles.data(),
  sizeof(NvAR_Point3f));

// Set output memory for tracking bounding boxes
NvAR_TrackingBBoxes output_tracking_bboxes{};
std::vector<NvAR_TrackingBBox> output_tracking_bbox_data;
output_tracking_bbox_data.assign(maxTargetsTracked, { 0.f, 0.f, 0.f,
  0.f, 0 });
output_tracking_bboxes.boxes = output_tracking_bbox_data.data();
output_tracking_bboxes.max_boxes = (uint8_t)output_tracking_bbox_size;

NvAR_SetObject(keyPointDetectHandle,
  NvAR_Parameter_Output(TrackingBoundingBoxes), &output_tracking_bboxes,
  sizeof(NvAR_TrackingBBoxes));

NvAR_Run(keyPointDetectHandle);

Facial Expression Estimation#

This section provides information about how to use the Facial Expression Estimation feature.

Facial Expression Estimation for Static Frames (Images)#

Typically, the input to the Facial Expression Estimation feature is an input image and a set of detected landmark points that correspond to the face on which we want to estimate face expression coefficients.

The following example shows the typical usage of this feature, where the detected facial keypoints from the Landmark Detection feature are passed as input to this feature:

//Set facial keypoints from Landmark Detection as an input
err = NvAR_SetObject(faceExpressionHandle,
  NvAR_Parameter_Input(Landmarks), facial_landmarks.data(),
  sizeof(NvAR_Point2f));

//Set output memory for expression coefficients
unsigned int expressionCount;
err = NvAR_GetU32(faceExpressionHandle,
  NvAR_Parameter_Config(ExpressionCount), &expressionCount);

float expressionCoeffs = new float[expressionCount];
err = NvAR_SetF32Array(faceExpressionHandle,
  NvAR_Parameter_Output(ExpressionCoefficients), expressionCoeffs,
  expressionCount);

//Set output memory for pose rotation quaternion
NvAR_Quaternion pose = new NvAR_Quaternion();
err = NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Output(Pose),
  pose, sizeof(NvAR_Quaternion));

//Optional: If desired, set memory for bounding boxes and their confidences

err = NvAR_Run(faceExpressionHandle);

Alternative Usage of the Facial Expression Estimation Feature#

Like the alternative usage of the Landmark detection feature, the Facial Expression Estimation feature can be used to determine the detected face bounding box, the facial keypoints, and a 3D face mesh and its rendering parameters.

When you provide an input image instead of the facial keypoints of a face, the face and the facial keypoints are automatically detected and are used to run the expression estimation. This way, if BoundingBoxes, Landmarks, or both are set as optional output properties for this feature, these properties are populated with the bounding box that contains the face and the detected facial keypoints.

ExpressionCoefficients and Pose are not optional properties for this feature. To run the feature, these properties must be set with user-provided output buffers. If this feature is also run without providing facial keypoints as an input, the path to which the ModelDir configuration property points must also contain the face and landmark detection TRT package files. Optionally, CUDAStream and the Temporal flag can be set for those features.

The expression coefficients can be used to drive the expressions of an avatar.

Note

The facial keypoints or the face bounding box that were determined internally can be queried from this feature but are not required for the feature to run.

The following example uses the Facial Expression Estimation feature to obtain the face expression coefficients directly from the image, without explicitly running Landmark Detection or Face Detection:

//Set input image buffer instead of providing facial keypoints
NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Input(Image),
  &inputImageBuffer, sizeof(NvCVImage));

//Set output memory for expression coefficients
unsigned int expressionCount;
err = NvAR_GetU32(faceExpressionHandle,
  NvAR_Parameter_Config(ExpressionCount), &expressionCount);
float expressionCoeffs = new float[expressionCount];
err = NvAR_SetF32Array(faceExpressionHandle,
  NvAR_Parameter_Output(ExpressionCoefficients), expressionCoeffs,
  expressionCount);

//Set output memory for pose rotation quaternion
NvAR_Quaternion pose = new NvAR_Quaternion();
err = NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Output(Pose),
  pose, sizeof(NvAR_Quaternion));

//Optional: Set facial keypoints as an output
NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Output(Landmarks),
  facial_landmarks.data(),sizeof(NvAR_Point2f));

//Optional: Set output memory for bounding boxes or other parameters,
//such as pose, bounding box confidence, and landmarks confidence

NvAR_Run(faceExpressionHandle);

Facial Expression Estimation Tracking for Temporal Frames (Videos)#

If the Temporal flag is set and face and landmark detection are run internally, these features are optimized for temporally related frames. This means that face and facial keypoints are tracked across frames and, if requested, only one bounding box is returned as an output. If the Face Detection and Landmark Detection features are explicitly used, they need their own Temporal flags to be set. This flag also affects the Facial Expression Estimation feature through the NVAR_TEMPORAL_FILTER_FACIAL_EXPRESSIONS, NVAR_TEMPORAL_FILTER_FACIAL_GAZE, and NVAR_TEMPORAL_FILTER_ENHANCE_EXPRESSIONS bits.

LipSync#

This section provides information about how to use the LipSync feature. LipSync uses an audio input to modify a video of a person, animating the person’s lips and lower face to match the audio.

LipSync Processing#

The LipSync feature takes synchronized audio samples and video frames as inputs, and produces modified video frames as output. The following example demonstrates how to process video and audio frames with the LipSync feature:

// Allocate source image
NvCVImage_Realloc(&source_image, source_width, source_height, NVCV_BGR, NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Allocate source audio frame
std::vector<float> audio_frame(audio_frame_length);

// Allocate generated image (same resolution as the source image)
NvCVImage_Realloc(&gen_image, source_width, source_height, NVCV_BGR, NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Set the video frame rate
NvAR_SetF32(lipsync_han, NvAR_Parameter_Config(VideoFPS), 30.0f);

// Load is required before setting images
NvAR_Load(lipsync_han);

// Set input and output images
NvAR_SetObject(lipsync_han, NvAR_Parameter_Input(Image), &source_image, sizeof(NvCVImage));
NvAR_SetF32Array(lipsync_han, NvAR_Parameter_Input(AudioFrameBuffer), audio_frame.data(), audio_frame.size());
NvAR_SetObject(lipsync_han, NvAR_Parameter_Output(Image), &gen_image, sizeof(NvCVImage));

// Run the feature
NvAR_Run(lipsync_han);

Region-Based LipSync#

The LipSync feature supports specifying speaker regions through the LipSyncRegionData input property. This property specifies the area of the frame that contains the speaker’s face, as well as per-region settings such as bypass and speaking flags.

Important

LipSync supports only a single region with is_speaking set to a nonzero value at any given time. The feature smoothly activates or deactivates LipSync processing when the is_speaking flag changes, or when regions with different tracking_id values are marked as is_speaking in subsequent frames.

If multiple regions are provided with is_speaking set, only one is processed. If your application tracks multiple speakers, ensure that only one region is marked as is_speaking in each frame.

Attention

The SpeakerData input property is no longer supported. Use LipSyncRegionData instead.

The following example demonstrates how to set up region-based LipSync:

// Define the speaker region
NvAR_LipSyncRegion region{};
region.bbox = {x, y, width, height};  // Face bounding box
region.tracking_id = 0;
region.bypass = 0.0f;       // 0 = fully animated, 1 = no animation
region.region_type = 0;     // 0 = ROI (face detection within ROI), 1 = face box
region.is_speaking = 1;     // Non-zero if this person is speaking

NvAR_LipSyncRegionData region_data{};
region_data.regions = &region;
region_data.num_regions = 1;

NvAR_SetObject(lipsync_han, NvAR_Parameter_Input(LipSyncRegionData),
  &region_data, sizeof(NvAR_LipSyncRegionData));

LipSync Activation Output#

An output parameter provides information about the LipSync activation, including the activation strength and the face location and size:

NvAR_LipSyncActivation activation{};
NvAR_SetObject(lipsync_han, NvAR_Parameter_Output(Activation),
  &activation, sizeof(NvAR_LipSyncActivation));

After NvAR_Run completes, the following information is available:

activation.strength: Indication of how much the face was modified (0–1).
activation.center_x, activation.center_y: Coordinates of face center in pixels.
activation.size: Face size in pixels.

Input/Output Latency#

Between the input and output video frames is a fixed amount of latency; that is, at the beginning of a video the first calls to NvAR_Run cause the feature to ingest the input video frame but do not generate an output frame. After a fixed number of input-only calls to NvAR_Run, the feature generates the first output video frame. At the end of a video, the feature generates the final output frames without requiring new input video frames.

The fixed latency between input and output can be queried using the NumInitialFrames parameter. The following example shows how to query the latency between input and output frames.

// Number of initial input video frames to process before retrieving an output frame.
uint32_t latency_frame_cnt = 0;
NvAR_GetU32(lipsync_han, NvAR_Parameter_Config(NumInitialFrames), &latency_frame_cnt);

Alternatively, you can use the Ready output parameter to check directly whether an output frame has been generated after each call to NvAR_Run.

unsigned int output_ready = 0;
NvAR_SetU32Array(lipsync_han, NvAR_Parameter_Output(Ready), &output_ready, 1);

Active Speaker Detection#

This section provides information about how to use the Active Speaker Detection feature. Active Speaker Detection identifies which person in a video is currently speaking by analyzing both video frames and synchronized audio tracks. Typical use cases include in-studio capture (e.g. sports, news, interviews) with one or more speakers talking in turn without overlapping speech.

Input Requirements#

Video: Input resolution must be between 360 × 360 and 3840 × 2160 pixels (BGR, 8-bit, GPU).

Audio: For best results, prepare audio as follows:

Use diarized audio from the same (original, non-translated) source as the video, with frame-accurate synchronization from start to end. Tracks can terminate early if the speaker stops.
Provide one track per speaker. Each track must contain all the speech from any given speaker, and only that speaker’s speech. The number of input tracks must exactly match the number of unique speakers. Use silence (zeros) when that speaker is not talking or use NvAR_ActiveAudioIds to filter out inactive speakers. Ensure that tracks are clean and isolated (no background noise or music).
Each track must be mono. Supported sample rates are 16 kHz, 44.1 kHz, and 48 kHz.

Performance: The feature has startup latency on the order of a few seconds and is designed for real-time processing of a single speaker on supported hardware.

Active Speaker Detection Processing#

The Active Speaker Detection feature takes video frames and multiple audio tracks as inputs, and produces tracking data with speaker identification information as output. The following example demonstrates how to process video and audio frames with the Active Speaker Detection feature:

// Create feature
NvAR_FeatureHandle speaker_detection_handle;
NvAR_Create(NvAR_Feature_ActiveSpeakerDetection, &speaker_detection_handle);

// Set up CUDA stream
CUstream stream;
NvAR_CudaStreamCreate(&stream);
NvAR_SetCudaStream(speaker_detection_handle, NvAR_Parameter_Config(CUDAStream), stream);

// Configure the feature
NvAR_SetString(speaker_detection_handle, NvAR_Parameter_Config(ModelDir), model_path);
NvAR_SetF32(speaker_detection_handle, NvAR_Parameter_Config(VideoFPS), 30.0f);
NvAR_SetU32(speaker_detection_handle, NvAR_Parameter_Config(NumAudioStreams), 2);
NvAR_SetU32(speaker_detection_handle, NvAR_Parameter_Config(SampleRate), 44100);
uint32_t max_num_output_identities;
NvAR_GetU32(speaker_detection_handle, NvAR_Parameter_Config(MaxNumOutputIdentities), &max_num_output_identities);

// Load the feature
NvAR_Load(speaker_detection_handle);

// Allocate input image (BGR, U8, GPU)
NvCVImage input_image;
NvCVImage_Realloc(&input_image, width, height, NVCV_BGR, NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);

// Set up audio frame data for multiple audio tracks
std::vector<NvAR_AudioFrame> audio_frames(num_audio_tracks);
std::vector<std::vector<float>> audio_buffers(num_audio_tracks);

for (unsigned int i = 0; i < num_audio_tracks; ++i) {
  audio_buffers[i].resize(audio_frame_length);           // See "Variable Audio Frame Size" to determine audio_frame_length
  audio_frames[i].audio_data = audio_buffers[i].data();  // Audio samples for a single frame
  audio_frames[i].num_samples = audio_frame_length;
  audio_frames[i].audio_id = i;
}

NvAR_AudioFrameData audio_frame_data;
audio_frame_data.audio_frames = audio_frames.data();
audio_frame_data.num_audio_channels = num_audio_tracks;

// Set up active audio IDs
std::vector<uint32_t> active_audio_ids = {0, 1}; // Audio tracks 0 and 1 are active
NvAR_ActiveAudioIds active_audio_ids_data;
active_audio_ids_data.active_audio_ids = active_audio_ids.data();
active_audio_ids_data.num_active_audio_ids = active_audio_ids.size();

// Set up output tracking data
std::vector<NvAR_SpeakerTrackingBBox> output_boxes(max_num_output_identities);
NvAR_ActiveSpeakerTrackingData output_tracking_data;
output_tracking_data.boxes = output_boxes.data();
output_tracking_data.max_boxes = max_num_output_identities;

// Set up control parameters
uint32_t new_shot = NVARACTIVESPEAKERDETECTION_DETECT_SHOT_CHANGE; // Auto-detect shot changes
uint32_t flush = 0;    // Normal processing
uint32_t ready = 0;    // Output ready status

// Set inputs and outputs
NvAR_SetObject(speaker_detection_handle, NvAR_Parameter_Input(Image), &input_image, sizeof(NvCVImage));
NvAR_SetObject(speaker_detection_handle, NvAR_Parameter_Input(AudioFrameData), &audio_frame_data, sizeof(NvAR_AudioFrameData));
NvAR_SetObject(speaker_detection_handle, NvAR_Parameter_Input(ActiveAudioIDs), &active_audio_ids_data, sizeof(NvAR_ActiveAudioIds));
NvAR_SetU32Array(speaker_detection_handle, NvAR_Parameter_Input(NewShot), &new_shot, 1);
NvAR_SetU32Array(speaker_detection_handle, NvAR_Parameter_Input(Flush), &flush, 1);

NvAR_SetObject(speaker_detection_handle, NvAR_Parameter_Output(ActiveSpeakerTrackingData), &output_tracking_data, sizeof(NvAR_ActiveSpeakerTrackingData));
NvAR_SetU32Array(speaker_detection_handle, NvAR_Parameter_Output(Ready), &ready, 1);

// Run the feature
NvAR_Run(speaker_detection_handle);

// Check if output is ready
if (ready != 0) {
  // Process the tracking results
  for (unsigned int i = 0; i < output_tracking_data.num_boxes; ++i) {
    const NvAR_SpeakerTrackingBBox& bbox = output_tracking_data.boxes[i];
    // bbox.bbox contains the face bounding box (x, y, width, height)
    // bbox.tracking_id contains the unique person identifier
    // bbox.audio_id contains the associated audio track ID (-1 if none)
    // bbox.confidence contains the detection confidence score
    // bbox.is_speaking indicates if this person is currently speaking
  }
}

Multiple Audio Tracks#

The Active Speaker Detection feature supports processing multiple audio tracks simultaneously to determine which person is speaking. Each audio track is identified by a unique audio_id and can be selectively enabled using the ActiveAudioIDs input parameter. Each audio track must contain speech from one and only one speaker.

The following example shows how to configure multiple audio tracks:

// Set up 4 audio tracks with different audio IDs
std::vector<NvAR_AudioFrame> audio_frames(4);
std::vector<std::vector<float>> audio_buffers(4);

for (unsigned int i = 0; i < 4; ++i) {
  audio_buffers[i].resize(audio_frame_length);
  audio_frames[i].audio_data = audio_buffers[i].data();
  audio_frames[i].num_samples = audio_frame_length;
  audio_frames[i].audio_id = i; // Audio IDs: 0, 1, 2, 3
}

// Only activate audio tracks 0 and 2 for this frame
std::vector<uint32_t> active_audio_ids = {0, 2};
NvAR_ActiveAudioIds active_audio_ids_data;
active_audio_ids_data.active_audio_ids = active_audio_ids.data();
active_audio_ids_data.num_active_audio_ids = 2;

Shot Change Detection#

The feature includes automatic shot change detection to maintain tracking consistency across video cuts. The NewShot input parameter controls this behavior:

NVARACTIVESPEAKERDETECTION_DETECT_SHOT_CHANGE (255): Automatic shot change detection.
NVARACTIVESPEAKERDETECTION_SHOT_UNCHANGED (0): No shot change occurred.
NVARACTIVESPEAKERDETECTION_SHOT_CHANGED (1): Manual indication of shot change.

// Automatic shot change detection (default)
uint32_t new_shot = NVARACTIVESPEAKERDETECTION_DETECT_SHOT_CHANGE;

// Manual shot change indication
uint32_t new_shot = NVARACTIVESPEAKERDETECTION_SHOT_CHANGED;

// Explicitly indicate no shot change
uint32_t new_shot = NVARACTIVESPEAKERDETECTION_SHOT_UNCHANGED;

Output Synchronization#

The feature uses an internal buffer to accumulate frames before producing output. The Ready output parameter indicates when valid output data are available. The output is offset by the number of calls made to NvAR_Run() before the first output is generated.

uint32_t ready = 0;
NvAR_SetU32Array(speaker_detection_handle, NvAR_Parameter_Output(Ready), &ready, 1);

// After NvAR_Run()
if (ready != 0) {
  // Output data in output_tracking_data is valid
  // Process the tracking results...
} else {
  // Still accumulating frames, output not ready yet
}

Flush Mode#

After reaching the end of input data, use flush mode to retrieve final outputs without providing new input frames:

// Enable flush mode at end of stream
uint32_t flush = 1;
NvAR_SetU32Array(speaker_detection_handle, NvAR_Parameter_Input(Flush), &flush, 1);

// Continue calling NvAR_Run() until the return code of the feature is NVCV_ERR_EOF

Variable Audio Frame Size#

When the number of audio samples per video frame is not a whole number (for example, a sample rate of 16,000 Hz at 30 FPS yields 533.33 samples per frame), the number of audio samples should alternate to maintain proper audio-video synchronization.

The following example demonstrates how to calculate the exact number of audio samples for each frame:

// Configuration
const float video_fps = 30.0f;
const uint32_t sample_rate = 16000;
const double frame_duration = 1.0 / video_fps;  // 0.0333... seconds per frame

// State tracking
uint32_t last_audio_end_sample = 0;
uint32_t frame_count = 0;

// For each video frame
while (processing_video) {
  const double frame_timestamp = frame_count * frame_duration;

  // Calculate exact audio window for this frame
  const uint32_t audio_start_sample = last_audio_end_sample;
  const uint32_t audio_end_sample =
      static_cast<uint32_t>((frame_timestamp + frame_duration) * sample_rate);
  const uint32_t audio_frame_length = audio_end_sample - audio_start_sample;

  // Update tracking for next frame
  last_audio_end_sample = audio_end_sample;

  // Set the same number of samples for all tracks
  for (unsigned int track_idx = 0; track_idx < num_audio_tracks; ++track_idx) {
    audio_frames[track_idx].num_samples = audio_frame_length;
  }

  // audio_frame_length will alternate between 533 and 534 samples
  // to maintain precise audio-video synchronization

  frame_count++;
  // ... set other inputs and run NvAR_Run() ...
}

This approach ensures the following result:

Frame 0: 533 samples (indices 0 to 532)
Frame 1: 534 samples (indices 533 to 1066)
Frame 2: 533 samples (indices 1067 to 1599)
Frame 3: 533 samples (indices 1600 to 2132)
And so on…

The sample range is half-open: audio_start_sample is inclusive and audio_end_sample is exclusive, so each sample belongs to exactly one frame and no samples are reused. Do not set audio_start_sample to last_audio_end_sample + 1; that would skip one sample per frame and break synchronization.

The alternating sample counts maintain audio-video synchronization by accumulating the fractional parts of the samples-per-frame calculation.