NvDCF parameter tuning guide¶
A general introduction to NvDCF tracker can be found in DeepStream SDK Plugin Manual, but this guide is intended to help troubleshoot and fine-tune config parameters with some concrete examples and details. We will first review the config parameters for NvDCF tracker to help the users to better understand the parameters depending on the category of the issues (e.g., accuracy, performance, etc.), and then present a set of troubleshooting guides.
Visual feature types
Visual Feature Types and Feature Sizes¶
NvDCF tracker can use multiple types of visual features such as Histogram of Oriented Gradient (HOG) and ColorNames. HOG features consist of 18 channels based on the number of bins for different orientations, while ColorNames has 10 channels. If both features are used (by setting
useHog), then the total number of channels would be 28. The more channels of visual features are used, the more accurately the algorithm would track but would increase the computational complexity and reduce performance.
We can also configure the precision of the visual features between 8- and 16-bit representation internally. By setting useHighPrecisionFeature, it enables 16-bit representation, which would allow more accurate tracking yet at the cost of increased computational load.
In addition to the types and precision of the visual features, we can configure the number of pixels used to represent an object for each feature channel. The corresponding parameter is featureImgSizeLevel, and its range is from 1 to 5. Each level between 1 and 5 corresponds to 12x12, 18x18, 24x24, 36x36, and 48x48, respectively, for each feature channel. Therefore, if one uses both HOG and ColorNames with featureImgSizeLevel: 5, then the total number of pixels that represents an object would be 28x48x48.
One thing to note is that the visual features for an object are extracted from a region whose size is a bit larger than the object region in order to make sure that the object in the next frame appears within the region where the visual features are extracted even when there is a movement by the object between frames. This region is referred to as the search region, whose size is defined by adding a degree of padding to the object bbox. More details can be found in the section for NvDCF tracker in DeepStream Plugin Manual.
Increasing the search region size lowers the probability of missing the object in the next frame; however, given a fixed feature size (i.e., featureImgSizeLevel), if we increase SearchRegionPaddingScale, it would effectively decrease the number of pixels belonging to the object, resulting in lower resolution in terms of object representation in visual features. This may result in lower accuracy in tracking; however, if the degree of movement of an object between two consecutive frames is expected to be small, the object would be highly likely to appear in the search region in the next frame even with a smaller search region size. It would be especially the case if a state estimator is enabled (by setting useTrackSmoothing) and the prediction by the state estimator is reasonably accurate, because the search region would be defined at the predicted location in the next frame.
To deal with false positives and false negatives from the detector, NvDCF utilizes two strategies called Late Activation and Shadow Tracking (more details can be found in DeepStream SDK Plugin Manual). In addition to the config parameters related to those strategies, there are a few config parameters that affect when a tracker for a new object is created and terminated.
An object detected by a detector meets the minimum qualifications (i.e., target candidacy) specified by the following, a new tracker is instantiated for the object:
If spurious false detections are observed with lower detector confidence values, one can increase the minimum detector confidence (i.e., minDetectorConfidence) to filter them out. If one wishes to start tracking objects when they are fully visible in the scene, for example, he/she can configure minDetectorBboxVisibilityTobeTracked with a higher value like 1.0.
Once a tracker is instantiated for a new object, it initially starts tracking the object in a temporary mode until further criteria are met during a period specified by probationAge in terms of the number of frames. During this probationary period, whenever the tracker bbox is not matched with detector bbox or the tracker confidence gets lower than minTrackerConfidence, the shadow tracking age (which is an internal variable) is incremented. If the shadow tracking age exceeds a predefined threshold (i.e., earlyTerminationAge), then the tracker is terminated, effectively eliminating the false positives. If a tracker is not terminated during this probationary period and it is matched with a detector bbox in the last frame, it will be set to the Active mode and starts reporting its outputs to the downstream module.
If a higher rate of false detections is expected, then one may consider to increase the probationAge and/or decrease earlyTerminationAge for stricter creation policy. If the expected detector confidence for the false detections is low while that of the true positives is high, one can set minDetectorConfidence accordingly to filter out false detections.
In addition to the aforementioned early termination during the probationary period, there are certain criteria to be met when a tracker is terminated.
Once a tracker starts tracking in Active mode, its status changes to Inactive mode if: 1. The tracking confidence is lower than minTrackerConfidence or 2. It is not matched with a detector bbox during data association. The shadow tracking age is incremented every frame while a tracker is in Inactive mode. If the tracker gets matched again with a detector bbox, then the shadow tracking age is reset to zero, and the tracker’s mode changes to Active mode again (meaning that the tracker outputs will be reported to the downstream). However, if the shadow tracking age exceeds a predefined threshold (i.e., maxShadowTrackingAge), the tracker will be terminated.
A side note is that even if the tracker is currently in Inactive mode, if the tracker confidence is higher than minTrackingConfidenceDuringInactive, the output of the tracker will still be reported. Thus, if this parameter is set too low, then the tracker may report low-confidence results, which often results in false positives.
For more robust tracking, one may increase the value for maxShadowTrackingAge because it will allow an object to be re-associated even after missed detections over multiple consecutive frames. However, in case that the visual appearance of the object undergoes a significant change during the missed detections (e.g., prolonged occlusions), the learned correlation filter may not yield a high correlation response when the object reappears and is re-associated. In addition, increasing maxShadowTrackingAge would allow a tracker to live longer (i.e., more delayed termination), resulting in an increased number of trackers at a given time, which would in turn increase the computational load.
If a tracker’s bbox size gets smaller than minTargetBboxSize or its visibility gets lower than minVisibiilty4Tracking, it will also be terminated.
NvDCF performs a periodic check with an interval of targetDuplicateRunInterval in terms of frames to see if there are any duplicate trackers. Duplicate trackers may exist if multiple trackers are instantiated for the same object or if multiple objects are merged or self-occluded. If the intersection-over-union (IOU) of the tracker bboxes exceeds the predefined threshold of minIou4TargetDuplicate during the periodic checks, the tracker with the highest tracking age will remain and the others will be terminated.
An object tracker in NvDCF maintains a set of states for a target like below: * Target location (in 2D camera coordinates)
As the name suggests, the moving-average estimator (MAE) estimates each state by employing an exponential moving average with predefined learning rates: * trackExponentialSmoothingLr_loc for location states * trackExponentialSmoothingLr_scale for scale state * trackExponentialSmoothingLr_velocity for velocity states for location and scale
More specifically, just like the Kalman filter (KF), the MAE also makes predictions based on its estimated states and fuses them with the new measurement with the aforementioned learning rates.
If you trust more on the new measurements than the predictions (similar to the case where the measurements have lower uncertainty in KF), you may set a higher value for the learning rate parameters.
In case that the objects are expected to have monotonic motions in the scene, one may set a low value to trackExponentialSmoothingLr_velocity. Similarly, if the scale of the objects is expected to be relatively static, then one may set a low value to trackExponentialSmoothingLr_scale as well.
The Kalman Filter (KF) implementation in NvDCF mostly follows a standard 2D KF approach where the user needs to define the process noise and measurement noise based on the expected uncertainty level. One additional consideration that is put in is to allow users to set different measurement noise for detector bbox and tracker bbox. There is always a possibility of false negatives by the detector or the detector may not run at every frame. For such cases, each object tracker makes its own localization using the learned correlation filter, and the results are used to update the Kalman filter. Thus, from KF’s point of view, the measurements are from two different sources: one from the detector and the other from the tracker. Thus, the measurement noises do not have to be set equally, and that is why users can specify them differently in kfMeasurementNoiseVar4Det for the detector measurements and kfMeasurementNoiseVar4Trk for the tracker measurements.
Matching Score Weights
In the video frames where the detector performs inference (referred to as the inference frames), NvDCF performs the data association to match a set of object trackers to a set of detector bboxes. To reduce the matching cost in terms of computation, it is essential to define a small set of good candidates for each object tracker. That is where the criteria for matching candidacy comes in. For each tracker bbox, only the detector bboxes that are qualified in terms of the minimum size similarity, IOU, and the visual similarity are marked as candidate for matching. The visual similarity is computed based on the correlation response of the tracker at the detector bbox location. If one wants to consider only the detector bboxes that have at least some overlap with the tracker bbox, for example, then minMatchingScore4Iou would need to be set with a non-zero value. One can tune the other parameters in a similar manner.
Given a set of candidate detector bboxes for each tracker, the data association matrix is constructed between the detector bbox set and the tracker set with the matching scores as the value for the elements in the matrix. The matching score for each element is computed as a weighted sum of (1) the visual similarity, (2) the size similarity, (3) IOU score, and (4) the inverse of the tracker age. The corresponding weights for each term is respectively: * matchingScoreWeight4VisualSimilarity * matchingScoreWeight4SizeSimilarity * matchingScoreWeight4Iou * matchingScoreWeight4Age
The resulting matching score is put into the data association matrix only if the score exceeds a predefined threshold (i.e., minMatchingScore4Overall)
DCF Core Tuning¶
Apart from the types and sizes of the visual features employed, there are parameters related to how to learn and update the classifier for each object in DCF frameworks, which would affect the accuracy.
DCF Filter Learning¶
DCF filter learning
DCF-based trackers learn a classifier (i.e., discriminative correlation filter) for each object with implicit positive and negative samples. Such learned classifiers are updated on-the-fly for temporal consistency with a predefined learning rate (i.e., filterLr). If the visual appearance of the target objects is expected to vary quickly over time, one may employ a high learning rate for better adaptation of the correlation filter to the changing appearance. However, there is a risk of learning the background quickly as well, resulting in potentially more frequent track drift.
As NvDCF tracker utilizes multi-channel visual features, it is of concern on how to merge those channels for the final correlation response. NvDCF employs an adaptive channel weight approach where the importance of each channel is examined on-the-fly, and the corresponding channel weights are updated over time with a pre-defined learning rate (i.e., filterChannelWeightsLr). The tuning strategy for this learning rate would be similar to the case of filterLr as described before.
When a correlation filter is learned, gaussianSigma determines how tight we want to fit the resulting filter to the positive sample. A lower value means the tighter fit, but it may result in overfitting. On the other hand, a higher value may result in lower discriminative power in the learned filter.
Instance-awareness * lambda_ia * maxInstanceNum_ia
In case of similarly-looking objects are nearby, the learned correlation filter for an object may produce higher response for the nearby objects as well. To mitigate this issue, NvDCF tracker can use the nearby objects as a regularization factor as well when the instance-awareness is enabled (by setting useInstanceAwareness). The weight for the regularization factor can be configured by lambda_ia, and the max number of nearby objects to be considered can be configured by maxInstanceNum_ia.
Theoretically, enabling instance-awareness would increase the discriminative power against nearby objects; however, it may decrease the discriminative power against background in turn. Thus, enabling this feature may not be beneficial to all the cases.