Using the AR Features#
This section provides information about how to use the AR features.
Face Detection and Tracking#
This section provides information about how to use the Face Detection and Tracking feature.
Face Detection for Static Frames (Images)#
To obtain detected bounding boxes, you can explicitly instantiate and run the face detection feature as follows, with the feature taking an image buffer as input.
The following example runs the Face Detection AR feature with an input image buffer and output memory to hold bounding boxes:
//Set input image buffer
NvAR_SetObject(faceDetectHandle, NvAR_Parameter_Input(Image), &inputImageBuffer, sizeof(NvCVImage));
//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(faceDetectHandle, NvAR_Parameter_Output(BoundingBoxes), &output_bboxes, sizeof(NvAR_BBoxes));
//Optional: If desired, set memory for bounding-box confidence values
NvAR_Run(faceDetectHandle);
Face Tracking for Temporal Frames (Videos)#
If Temporal is enabled, such as when you process a video frame instead of an image, only one face is returned. The largest face appears for the first frame, and this face is subsequently tracked over the following frames.
However, explicitly calling the face detection feature is not the only way to obtain a bounding box that denotes detected faces. For more information about how to use the Landmark Detection AR feature and return a face bounding box, see Landmark Detection and Tracking.
Landmark Detection and Tracking#
This section provides information about how to use the Landmark Detection and Tracking feature.
Landmark Detection for Static Frames (Images)#
Typically, the input to the landmark detection feature is an input image and a batch of bounding boxes. Currently, the maximum value is 1. These boxes denote the regions of the image that contain the faces on which you want to run landmark detection.
The following example runs the Landmark Detection AR feature after obtaining bounding boxes from Face Detection:
//Set input image buffer
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Input(Image),
&inputImageBuffer, sizeof(NvCVImage));
//Pass output bounding boxes from face detection as an input on which
//landmark detection is to be run
NvAR_SetObject(landmarkDetectHandle,
NvAR_Parameter_Input(BoundingBoxes), &output_bboxes,
sizeof(NvAR_BBoxes));
//Set landmark detection mode: Performance (0; default) or Quality (1)
unsigned int mode = 0; // Choose performance mode
NvAR_SetU32(landmarkDetectHandle, NvAR_Parameter_Config(Mode), mode);
//Set output buffer to hold detected facial keypoints
std::vector<NvAR_Point2f> facial_landmarks;
facial_landmarks.assign(OUTPUT_SIZE_KPTS, {0.f, 0.f});
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Output(Landmarks),
facial_landmarks.data(),sizeof(NvAR_Point2f));
NvAR_Run(landmarkDetectHandle);
Alternative Usage of Landmark Detection#
As described in Configuration Properties for Landmark Tracking, the Landmark Detection AR feature supports some optional parameters that determine how the feature can be run.
If bounding boxes are not provided to the Landmark Detection AR feature as inputs, face detection is automatically run on the input image, and the largest face bounding box is selected on which to run landmark detection.
If BoundingBoxes is set as an output property, the property is populated
with the selected bounding box that contains the face on which the
landmark detection was run. Landmarks is not an optional property;
to explicitly run this feature, this property must be set with a
provided output buffer.
Landmark Tracking for Temporal Frames (Videos)#
Additionally, if Temporal is enabled, such as when you process a video stream and face detection is run explicitly, only one bounding box is supported as an input for landmark detection.
When face detection is not explicitly run, by providing an input image instead of a bounding box, the largest detected face is automatically selected. The detected face and landmarks are then tracked as an optimization across temporally related frames.
Note
The internally determined bounding box can be queried from this feature, but is not required for the feature to run.
The following example uses the Landmark Detection AR feature to obtain landmarks directly from the image, without first explicitly running Face Detection:
//Set input image buffer
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Input(Image),
&inputImageBuffer, sizeof(NvCVImage));
//Set output memory for landmarks
std::vector<NvAR_Point2f> facial_landmarks;
facial_landmarks.assign(batchSize * OUTPUT_SIZE_KPTS, {0.f, 0.f});
NvAR_SetObject(landmarkDetectHandle, NvAR_Parameter_Output(Landmarks),
facial_landmarks.data(),sizeof(NvAR_Point2f));
//Set landmark detection mode: Performance (0; default) or Quality (1)
unsigned int mode = 0; // Choose performance mode
NvAR_SetU32(landmarkDetectHandle, NvAR_Parameter_Config(Mode), mode);
//Optional: If desired, set memory for bounding box
NvAr_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(landmarkDetectHandle,
NvAR_Parameter_Output(BoundingBoxes), &output_bboxes,
sizeof(NvAr_BBoxes));
//Optional: If desired, set memory for pose, landmark confidence, or
//even bounding box confidence
NvAR_Run(landmarkDetectHandle);
Eye Contact#
This feature estimates the gaze of a person from an eye patch that was
extracted by using landmarks and redirects the eyes to make the person
look at the camera in a permissible range of eye and head angles. The
feature also supports a mode where the estimation can be obtained
without redirection. The eye contact feature can be invoked by using the
GazeRedirection feature ID.
Eye contact feature has the following options:
Gaze Estimation
Gaze Redirection
In this release, gaze estimation and redirection of only one face in the frame is supported.
Gaze Estimation#
The estimation of gaze requires face detection and landmarks as input.
The inputs to the gaze estimator are an input image buffer and buffers
to hold facial landmarks and confidence scores. The output of gaze
estimation is the gaze vector (pitch, yaw) values in radians. A float
array must be set as the output buffer to hold estimated gaze. The
GazeRedirect parameter must be set to false.
The following example runs the Gaze Estimation with an input image buffer and output memory to hold the estimated gaze vector:
bool bGazeRedirect=false;
NvAR_SetU32(gazeRedirectHandle, NvAR_Parameter_Config(GazeRedirect),
bGazeRedirect);
//Set input image buffer
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Input(Image),
&inputImageBuffer, sizeof(NvCVImage));
//Set output memory for gaze vector
float gaze_angles_vector[2];
NvvAR_SetF32Array(gazeRedirectHandle,
NvAR_Parameter_Output(OutputGazeVector), gaze_angles_vector, batchSize
* 2);
//Optional: Set output memory for landmarks, head pose, head
//translation, and gaze direction
std::vector<NvAR_Point2f> facial_landmarks;
facial_landmarks.assign(batchSize * OUTPUT_SIZE_KPTS, {0.f, 0.f});
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Output(Landmarks),
facial_landmarks.data(),sizeof(NvAR_Point2f));
NvAR_Quaternion head_pose;
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Output(HeadPose),
&head_pose, sizeof(NvAR_Quaternion));
float head_translation[3] = {0.f};
NvAR_SetF32Array(gazeRedirectHandle,
NvAR_Parameter_Output(OutputHeadTranslation), head_translation,
batchSize * 3);
NvAR_Run(gazeRedirectHandle);
Gaze Redirection#
Gaze Redirection takes identical inputs as the gaze estimation.
In addition to the outputs of gaze estimation, to store the gaze
redirected image, an output image buffer of the same size as the input
image buffer must be set. The gaze is redirected to look at the
camera within a certain range of gaze angles and head poses. Outside
this range, the feature disengages. Head pose, head translation, and
gaze direction can be optionally set as outputs. The GazeRedirect
parameter must be set to true.
The following example runs Gaze Redirection with an input image buffer and output memory to hold the estimated gaze vector and an output image buffer to hold the gaze redirected image.
bool bGazeRedirect=true;
NvAR_SetU32(gazeRedirectHandle, NvAR_Parameter_Config(GazeRedirect),
bGazeRedirect);
//Set input image buffer
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Input(Image),
&inputImageBuffer, sizeof(NvCVImage));
//Set output memory for gaze vector
float gaze_angles_vector[2];
NvvAR_SetF32Array(gazeRedirectHandle,
NvAR_Parameter_Output(OutputGazeVector), gaze_angles_vector, batchSize
* 2);
//Set output image buffer
NvAR_SetObject(gazeRedirectHandle, NvAR_Parameter_Output(Image),
&outputImageBuffer, sizeof(NvCVImage));
NvAR_Run(gazeRedirectHandle);
Randomized Look Away#
A continuous redirection of gaze to look at the camera might give
a perception of “stare.” Some users might find this effect unnatural or
undesired. In order to occasionally break eye contact, we provide
randomized look aways in gaze redirection which can be optionally
enabled. While the gaze is always expected to redirect toward the
camera within the range of operation, enabling look away will make the
user occasionally break gaze lock to the camera with a micro-movement of
the eyes at randomly chosen time intervals. The EnableLookAway
parameter must be set to true to enable this feature.
Additionally, parameters LookAwayOffsetMax, LookAwayIntervalMin, and
LookAwayIntervalRange are optional parameters that can be used to tune
the extent and frequency of look away. For a detailed description and
default settings of these parameters, see
Configuration Properties for Eye
Contact.
Range Control#
The gaze redirection feature redirects the eyes to look at the camera within a certain range of head and eye motion in which eye contact is desired and looks natural. Beyond this range, the feature gradually transitions away from looking at the camera toward the estimated gaze and eventually turns off in a seamless manner. To provide for various use cases and user preferences, we provide range parameters for the user to control the range of gaze angles and head poses in which gaze redirection occurs and the range in which transition occurs before the redirection is turned off. These are optional parameters.
GazePitchThresholdLow and GazeYawThresholdLow define the
parameters for the pitch and yaw angles of the estimated gaze within
which gaze is redirected toward the camera. Beyond these angles,
redirected gaze transitions away from the camera and toward the
estimated gaze, turning off redirection beyond GazePitchThresholdHigh
and GazeYawThresholdHigh, respectively. Similarly, for head pose,
HeadPitchThresholdLow and HeadYawThresholdLow define the parameters for
pitch and yaw angles of the head pose within which gaze is redirected
toward the camera. Beyond these angles, redirected gaze transitions
away from the camera and toward the estimated gaze, turning off
redirection beyond HeadPitchThresholdHigh and HeadYawThresholdHigh.
For a detailed description and default settings of these parameters, see
Configuration Properties for Eye
Contact.
3D Body Pose Tracking#
This feature relies on temporal information to track the person in the scene, where the keypoints information from the previous frame is used to estimate the keypoints of the next frame.
3D Body Pose Tracking consists of the following parts:
Body Detection
3D Keypoint Detection
The feature supports single or multiple people in the frame and both full or upper-body images and videos.
3D Body Pose Tracking for Static Frames (Images)#
You can obtain the bounding boxes that encapsulate the people in the scene. To obtain detected bounding boxes, you can explicitly instantiate and run body detection and pass the image buffer as input.
The following example runs the Body Detection with an input image buffer and output memory to hold bounding boxes:
//Set input image buffer
NvAR_SetObject(bodyDetectHandle, NvAR_Parameter_Input(Image),
&inputImageBuffer, sizeof(NvCVImage));
//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(bodyDetectHandle, NvAR_Parameter_Output(BoundingBoxes),
&output_bboxes, sizeof(NvAR_BBoxes));
//Optional: If desired, set memory for bounding-box confidence values
NvAR_Run(bodyDetectHandle);
The input to 3D Body Keypoint Detection is an input image. It outputs the 2D keypoints, 3D keypoints, keypoint confidence scores, and bounding box encapsulating the person.
The following example runs the 3D Body Pose Detection AR feature:
//Set input image buffer NvAR_SetObject(keypointDetectHandle, NvAR_Parameter_Input(Image), &inputImageBuffer, sizeof(NvCVImage)); //Pass output bounding boxes from body detection as an input on which //landmark detection is to be run NvAR_SetObject(keypointDetectHandle, NvAR_Parameter_Input(BoundingBoxes), &output_bboxes, sizeof(NvAR_BBoxes)); //Set output buffer to hold detected keypoints std::vector<NvAR_Point2f> keypoints; std::vector<NvAR_Point3f> keypoints3D; std::vector<NvAR_Point3f> jointAngles; std::vector<float> keypoints_confidence; // Get the number of keypoints unsigned int numKeyPoints; NvAR_GetU32(keyPointDetectHandle, NvAR_Parameter_Config(NumKeyPoints), &numKeyPoints); keypoints.assign(batchSize * numKeyPoints , {0.f, 0.f}); keypoints3D.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f}); jointAngles.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f}); NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints), keypoints.data(), sizeof(NvAR_Point2f)); NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints3D), keypoints3D.data(), sizeof(NvAR_Point3f)); NvAR_SetF32Array(keyPointDetectHandle, NvAR_Parameter_Output(KeyPointsConfidence), keypoints_confidence.data(), batchSize * numKeyPoints); NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(JointAngles), jointAngles.data(), sizeof(NvAR_Point3f)); //Set output memory for bounding boxes NvAR_BBoxes = output_boxes{}; output_bboxes.boxes = new NvAR_Rect[25]; output_bboxes.max_boxes = 25; NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(BoundingBoxes), &output_bboxes, sizeof(NvAR_BBoxes)); NvAR_Run(keyPointDetectHandle);
3D Body Pose Tracking for Temporal Frames (Videos)#
The feature relies on temporal information to track the person in the scene. The keypoints information from the previous frame is used to estimate the keypoints of the next frame.
The following example uses the 3D Body Pose Tracking AR feature to obtain 3D Body Pose Keypoints directly from the image:
//Set input image buffer
NvAR_SetObject(keypointDetectHandle, NvAR_Parameter_Input(Image),
&inputImageBuffer, sizeof(NvCVImage));
//Pass output bounding boxes from body detection as an input on which
//landmark detection is to be run
NvAR_SetObject(keypointDetectHandle,
NvAR_Parameter_Input(BoundingBoxes), &output_bboxes,
sizeof(NvAR_BBoxes));
//Set output buffer to hold detected keypoints
std::vector<NvAR_Point2f> keypoints;
std::vector<NvAR_Point3f> keypoints3D;
std::vector<NvAR_Point3f> jointAngles;
std::vector<float> keypoints_confidence;
// Get the number of keypoints
unsigned int numKeyPoints;
NvAR_GetU32(keyPointDetectHandle, NvAR_Parameter_Config(NumKeyPoints),
&numKeyPoints);
keypoints.assign(batchSize * numKeyPoints , {0.f, 0.f});
keypoints3D.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
jointAngles.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f});
NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints),
keypoints.data(), sizeof(NvAR_Point2f));
NvAR_SetObject(keyPointDetectHandle,
NvAR_Parameter_Output(KeyPoints3D), keypoints3D.data(),
sizeof(NvAR_Point3f));
NvAR_SetF32Array(keyPointDetectHandle,
NvAR_Parameter_Output(KeyPointsConfidence),
keypoints_confidence.data(), batchSize * numKeyPoints);
NvAR_SetObject(keyPointDetectHandle,
NvAR_Parameter_Output(JointAngles), jointAngles.data(),
sizeof(NvAR_Point3f));
//Set output memory for bounding boxes
NvAR_BBoxes = output_boxes{};
output_bboxes.boxes = new NvAR_Rect[25];
output_bboxes.max_boxes = 25;
NvAR_SetObject(keyPointDetectHandle,
NvAR_Parameter_Output(BoundingBoxes), &output_bboxes,
sizeof(NvAR_BBoxes));
NvAR_Run(keyPointDetectHandle);
Multi-Person Tracking for 3D Body Pose Tracking#
The feature provides the ability to track multiple people in the following ways:
In the scene across different frames.
When they leave the scene and enter the scene again.
When they are completely occluded by an object or another person and reappear (controlled using Shadow Tracking Age).
Shadow Tracking Age is a parameter that represents the period
of time where a target is still being tracked in the background even
when the target is not associated with a detector object. When a target
is not associated with a detector object for a time frame,
shadowTrackingAge, an internal variable of the target, is incremented.
After the target is associated with a detector object, shadowTrackingAge
is reset to zero. When the target age reaches the shadow tracking
age, the target is discarded and is no longer tracked. This is measured
by the number of frames; the default is 90.
Probation Age is the length of probationary period. After an object reaches this age, it is considered to be valid and is appointed an ID. This helps with false positives, where false objects are detected for only a few frames. This is measured by the number of frames; the default is 10.
Maximum Targets Tracked is the maximum number of targets to be tracked, which can be composed of the targets that are active in the frame and ones in shadow-tracking mode. When you select this value, keep the active and inactive targets in mind. The minimum is 1 and the default is 30.
Note
Currently, we actively track only eight people in the scene. More than eight people can appear throughout the video, but only a maximum of eight people in a given frame. Temporal mode is not supported for Multi-Person Tracking. The batch size should be 8 when Multi-Person Tracking is enabled.
The following example uses the 3D Body Pose Tracking AR feature to enable multi-person tracking and obtain the tracking ID for each person:
// Set input image buffer NvAR_SetObject(keypointDetectHandle, NvAR_Parameter_Input(Image), &inputImageBuffer, sizeof(NvCVImage)); // Enable Multi-Person Tracking NvAR_SetU32(keyPointDetectHandle, NvAR_Parameter_Config(TrackPeople), bEnablePeopleTracking); // Set Shadow Tracking Age NvAR_SetU32(keyPointDetectHandle, NvAR_Parameter_Config(ShadowTrackingAge), shadowTrackingAge); // Set Probation Age NvAR_SetU32(keyPointDetectHandle, NvAR_Parameter_Config(ProbationAge), probationAge); // Set Maximum Targets to be tracked NvAR_SetU32(keyPointDetectHandle, NvAR_Parameter_Config(MaxTargetsTracked), maxTargetsTracked); // Set output buffer to hold detected keypoints std::vector<NvAR_Point2f> keypoints; std::vector<NvAR_Point3f> keypoints3D; std::vector<NvAR_Point3f> jointAngles; std::vector<float> keypoints_confidence; // Get the number of keypoints unsigned int numKeyPoints; NvAR_GetU32(keyPointDetectHandle, NvAR_Parameter_Config(NumKeyPoints), &numKeyPoints); keypoints.assign(batchSize * numKeyPoints , {0.f, 0.f}); keypoints3D.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f}); jointAngles.assign(batchSize * numKeyPoints , {0.f, 0.f, 0.f}); NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints), keypoints.data(), sizeof(NvAR_Point2f)); NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(KeyPoints3D), keypoints3D.data(), sizeof(NvAR_Point3f)); NvAR_SetF32Array(keyPointDetectHandle, NvAR_Parameter_Output(KeyPointsConfidence), keypoints_confidence.data(), batchSize * numKeyPoints); NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(JointAngles), jointAngles.data(), sizeof(NvAR_Point3f)); // Set output memory for tracking bounding boxes NvAR_TrackingBBoxes output_tracking_bboxes{}; std::vector<NvAR_TrackingBBox> output_tracking_bbox_data; output_tracking_bbox_data.assign(maxTargetsTracked, { 0.f, 0.f, 0.f, 0.f, 0 }); output_tracking_bboxes.boxes = output_tracking_bbox_data.data(); output_tracking_bboxes.max_boxes = (uint8_t)output_tracking_bbox_size; NvAR_SetObject(keyPointDetectHandle, NvAR_Parameter_Output(TrackingBoundingBoxes), &output_tracking_bboxes, sizeof(NvAR_TrackingBBoxes)); NvAR_Run(keyPointDetectHandle);
Facial Expression Estimation#
This section provides information about how to use the Facial Expression Estimation feature.
Facial Expression Estimation for Static Frames (Images)#
Typically, the input to the Facial Expression Estimation feature is an input image and a set of detected landmark points that correspond to the face on which we want to estimate face expression coefficients.
The following example shows the typical usage of this feature, where the detected facial keypoints from the Landmark Detection feature are passed as input to this feature:
//Set facial keypoints from Landmark Detection as an input
err = NvAR_SetObject(faceExpressionHandle,
NvAR_Parameter_Input(Landmarks), facial_landmarks.data(),
sizeof(NvAR_Point2f));
//Set output memory for expression coefficients
unsigned int expressionCount;
err = NvAR_GetU32(faceExpressionHandle,
NvAR_Parameter_Config(ExpressionCount), &expressionCount);
float expressionCoeffs = new float[expressionCount];
err = NvAR_SetF32Array(faceExpressionHandle,
NvAR_Parameter_Output(ExpressionCoefficients), expressionCoeffs,
expressionCount);
//Set output memory for pose rotation quaternion
NvAR_Quaternion pose = new NvAR_Quaternion();
err = NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Output(Pose),
pose, sizeof(NvAR_Quaternion));
//Optional: If desired, set memory for bounding boxes and their confidences
err = NvAR_Run(faceExpressionHandle);
Alternative Usage of the Facial Expression Estimation Feature#
Like the alternative usage of the Landmark detection feature, the Facial Expression Estimation feature can be used to determine the detected face bounding box, the facial keypoints, and a 3D face mesh and its rendering parameters.
When you provide an input image instead of the facial keypoints of a
face, the face and the facial keypoints are automatically detected and
are used to run the expression estimation. This way, if BoundingBoxes,
Landmarks, or both are set as optional output properties for this
feature, these properties are populated with the bounding box that
contains the face and the detected facial keypoints.
ExpressionCoefficients and Pose are not optional properties for
this feature. To run the feature, these properties must be set with
user-provided output buffers. If this feature is also run without
providing facial keypoints as an input, the path to which the
ModelDir configuration property points must also contain the face
and landmark detection TRT package files. Optionally, CUDAStream and
the Temporal flag can be set for those features.
The expression coefficients can be used to drive the expressions of an avatar.
Note
The facial keypoints or the face bounding box that were determined internally can be queried from this feature but are not required for the feature to run.
The following example uses the Facial Expression Estimation feature to obtain the face expression coefficients directly from the image, without explicitly running Landmark Detection or Face Detection:
//Set input image buffer instead of providing facial keypoints
NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Input(Image),
&inputImageBuffer, sizeof(NvCVImage));
//Set output memory for expression coefficients
unsigned int expressionCount;
err = NvAR_GetU32(faceExpressionHandle,
NvAR_Parameter_Config(ExpressionCount), &expressionCount);
float expressionCoeffs = new float[expressionCount];
err = NvAR_SetF32Array(faceExpressionHandle,
NvAR_Parameter_Output(ExpressionCoefficients), expressionCoeffs,
expressionCount);
//Set output memory for pose rotation quaternion
NvAR_Quaternion pose = new NvAR_Quaternion();
err = NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Output(Pose),
pose, sizeof(NvAR_Quaternion));
//Optional: Set facial keypoints as an output
NvAR_SetObject(faceExpressionHandle, NvAR_Parameter_Output(Landmarks),
facial_landmarks.data(),sizeof(NvAR_Point2f));
//Optional: Set output memory for bounding boxes or other parameters,
//such as pose, bounding box confidence, and landmarks confidence
NvAR_Run(faceExpressionHandle);
Facial Expression Estimation Tracking for Temporal Frames (Videos)#
If the Temporal flag is set and face and landmark detection are
run internally, these features are optimized for temporally related
frames. This means that face and facial keypoints are tracked across
frames and, if requested, only one bounding box is returned as an
output. If the Face Detection and Landmark Detection features are
explicitly used, they need their own Temporal flags to be set.
This flag also affects the Facial Expression Estimation feature through
the NVAR_TEMPORAL_FILTER_FACIAL_EXPRESSIONS,
NVAR_TEMPORAL_FILTER_FACIAL_GAZE, and
NVAR_TEMPORAL_FILTER_ENHANCE_EXPRESSIONS bits.
LipSync#
This section provides information about how to use the LipSync feature. LipSync uses an audio input to modify a video of a person, animating the person’s lips and lower face to match the audio.
LipSync Processing#
The LipSync feature takes synchronized audio samples and video frames as inputs, and produces modified video frames as output. The following example demonstrates how to process video and audio frames with the LipSync feature:
// Allocate source image
NvCVImage_Realloc(&source_image, source_width, source_height, NVCV_BGR, NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);
// Allocate source audio frame
std::vector<float> audio_frame(audio_frame_length);
// Allocate generated image (same resolution as the source image)
NvCVImage_Realloc(&gen_image, source_width, source_height, NVCV_BGR, NVCV_U8, NVCV_CHUNKY, NVCV_GPU, 1);
// Load is required before setting images
NvAR_Load(lipsync_han);
// Set input and output images
NvAR_SetObject(lipsync_han, NvAR_Parameter_Input(Image), &source_image, sizeof(NvCVImage));
NvAR_SetF32Array(lipsync_han, NvAR_Parameter_Input(AudioFrameBuffer), audio_frame.data(), audio_frame.size());
NvAR_SetObject(lipsync_han, NvAR_Parameter_Output(Image), &gen_image, sizeof(NvCVImage));
// Run the feature
NvAR_Run(lipsync_han);
Input/Output Latency#
Between the input and output video frames is a fixed amount of latency; that is, at the beginning of a video, the feature reads a fixed number of input frames before it generates the first output video frame. At the end of a video, the feature generates the final output frames without requiring new input video frames.
The following example shows how to query the latency between input and output frames.
// Number of initial input video frames to process before retrieving an output frame.
uint32_t latency_frame_cnt = 0;
NvAR_GetU32(lipsync_han, NvAR_Parameter_Config(NumInitialFrames), &latency_frame_cnt);
An output parameter indicates whether the first output frame has been generated.
unsigned int output_ready = 0;
NvAR_SetU32Array(lipsync_han, NvAR_Parameter_Output(Ready), &output_ready, 1);