NvMultiObjectTracker Parameter Tuning Guide¶
A general introduction to NvMultiObjectTracker tracker library can be found in DeepStream SDK Plugin Manual, but this guide is intended to help troubleshoot and fine-tune config parameters with some concrete examples and details.
This section explores the configuration parameters and their potential impacts on both performance and accuracy on multi-object tracking operations.
Visual Feature Types and Feature Sizes¶
Visual feature types
NvDCF tracker can use multiple types of visual features such as Histogram of Oriented Gradient (HOG) and ColorNames. If both features are used (by setting
useColorNames: 1 and
useHog: 1), then the total number of channels would be 28. The more channels of visual features are used, the more accurately the algorithm would track but would increase the computational complexity and reduce performance.
In addition to the types of the visual features, we can configure the number of pixels used to represent an object for each feature channel. The corresponding parameter is
featureImgSizeLevel, and its range is from 1 to 5. Each level between 1 and 5 corresponds to 12x12, 18x18, 24x24, 36x36, and 48x48, respectively, for each feature channel. Therefore, if one uses both HOG and ColorNames with
featureImgSizeLevel: 5, then the dimension of visual features that represents an object would be 28x48x48.
One thing to note is that the visual features for an object are extracted from a region whose size is a bit larger than the object region in order to make sure that the object in the next frame appears within the region even when there is a movement by the object between frames. This region is referred to as the search region, whose size is defined by adding a degree of padding to the object bbox. More details can be found in the section for NvDCF tracker in DeepStream Plugin Manual.
Increasing the search region size lowers the probability of missing the object in the next frame; however, given a fixed feature size (i.e.,
featureImgSizeLevel), if we increase
searchRegionPaddingScale, it would effectively decrease the number of pixels belonging to the object, resulting in lower resolution in terms of object representation in visual features. This may result in lower accuracy in tracking; however, if the degree of movement of an object between two consecutive frames is expected to be small, the object would be highly likely to appear in the search region in the next frame even with a smaller search region size. It would especially be the case if a state estimator is enabled and the prediction by the state estimator is reasonably accurate, because the search region would be defined at the predicted location in the next frame.
Instead of reducing the visual feature types and sizes, users can explore increasing the detection interval instead (i.e.,
interval in PGIE config). Thanks to the enhanced accuracy and robustness, the NvDCF tracker allows users to increase the detection interval without sacrificing the accuracy too much. Especially when a heavier neural net model is used for the object detection, the performance gain by increasing the detection interval will be higher. Thus, users may consider increasing the detection interval instead of lowering the accuracy setting for NvDCF tracker.
Video Frame Size for Tracker¶
Video frame size for tracker
The video frame size configured in tracker plugin has some impact on the performance, as a higher resolution video frame would take longer time to transfer between memories. If one sets the frame resolution lower, hoping to achieve a higher performance, however, its negative impact on the accuracy may outweigh the performance gain. Therefore, it is recommended to use at least 960x544 resolution (for 1080p source resolution) to minimize the accuracy degradation.
To deal with false positives and false negatives from the detector, the NvMultiObjectTracker library utilizes two strategies called Late Activation and Shadow Tracking (more details can be found in DeepStream SDK Plugin Manual). In addition to the config parameters related to those strategies, there are a few config parameters that affect when a tracker for a new object is created and terminated.
Target Creation Policy¶
If an object detected by a detector meets the minimum qualifications (i.e., target candidacy) specified by the following, a new tracker is instantiated for the object:
If spurious false detections are observed with lower detector confidence values, one can increase the minimum detector confidence (i.e.,
minDetectorConfidence) to filter them out. If the maxmimum IOU score of a newly detected object to any of the existing targets is lower than
minIouDiff4NewTarget, a new target tracker would be created to track the object. Thus, if one wishes to further suppress the creation of duplicate bboxes on the same target that may have a bit different bbox sizes,
minIouDiff4NewTarget can be set lower.
Once a tracker is instantiated for a new object, it initially starts tracking the object in a temporary mode (i.e., Tentative mode) until further criteria are met during a period specified by
probationAge in terms of the number of frames. During this probationary period, whenever the tracker bbox is not matched with detector bbox or the tracker confidence gets lower than
minTrackerConfidence, the shadow tracking age (which is an internal variable) is incremented. If the shadow tracking age reaches a predefined threshold (i.e.,
earlyTerminationAge), then the tracker will be terminated prematurely, effectively eliminating the false positives.
If a higher rate of false detections is expected, then one may consider to increase the
probationAge and/or decrease
earlyTerminationAge for stricter creation policy. If the expected detector confidence for the false detections is low while that of the true positives is high, one can set
minDetectorConfidence accordingly to filter out false detections.
Target Termination Policy¶
In addition to the aforementioned early termination policy during the probationary period, there are certain criteria to be met when a tracker is terminated. Once a tracker starts tracking in Active mode, its status changes to Inactive mode if:
The tracker confidence is lower than
It is not matched with a detector bbox during data association.
The shadow tracking age is incremented every frame when a target is not associated with a detector object. If the tracker gets matched again with a detector bbox, then the shadow tracking age is reset to zero, and the tracker’s mode changes to Active mode again if it was in Inactive mode (meaning that the tracker outputs will be reported to the downstream). However, if the shadow tracking age exceeds a predefined threshold (i.e.,
maxShadowTrackingAge), the tracker will be terminated.
For more robust tracking, one may increase the value for
maxShadowTrackingAge because it will allow an object to be re-associated even after missed detections over multiple consecutive frames. However, in case that the visual appearance of the object undergoes a significant change during the missed detections (e.g., prolonged occlusions), the learned correlation filter may not yield a high correlation response when the object reappears. In addition, increasing
maxShadowTrackingAge would allow a tracker to live longer (i.e., more delayed termination), resulting in an increased number of trackers present at the memory at a given time, which would in turn increase the computational load.
An object tracker in NvMultiObjectTracker library maintains a set of states for a target like below:
Target location (in 2D camera coordinates)
The Kalman Filter (KF) implementation in NvMultiObjectTracker library mostly follows a standard 2D KF approach where the user needs to define the process noise and measurement noise based on the expected uncertainty level. If the object has relatively simple and linear motion, one may set the process noise lower than the measurement noise, effectively putting more trust on the prediction. If the object is expected to have more dynamic motions or abrupt changes of states, it would be more advised to set the measurement noise lower; otherwise, there could be some lagging if the prediction is not correct.
One additional consideration that is put in is to allow users to set different measurement noise for detector bbox and tracker bbox for the case where a visual tracker module is enabled (i.e., NvDCF). There is always a possibility of false negatives by the detector or there could be video frames where the inference for object detection is skipped. For such cases, each object tracker makes its own localization using the learned correlation filter, and the results are used to update the Kalman filter. Thus, from KF’s point of view, the measurements are from two different sources: one from the detector and the other from the tracker. In cases that the measurements are expected from multiple sources, such measurements are expected to be fused to estimate the target states properly with appropriate measurement models (i.e., uncertainty modeling for the measurements).
Depending on the accuracy characteristics of the detector and the tracker, the measurement noises should be configured accordingly. When a very high accuracy model is used for object detection, one may set
measurementNoiseVar4Detector value lower than
measurementNoiseVar4Tracker, effectively putting more trust on the detector’s measurement than the tracker’s prediction/localization.
Matching Score Weights
In the video frames where the detector performs inference (referred to as the inference frames), the NvDCF tracker performs the data association to match a set of detector objects to a set of existing targets. To reduce the computational cost for matching, it is essential to define a small set of good candidates for each object tracker. That is where the criteria for matching candidacy comes in. For each tracker bbox, only the detector bboxes that are qualified in terms of the minimum size similarity, IOU, and the visual similarity are marked as candidates for matching. The visual similarity is computed based on the correlation response of the tracker at the detector bbox location. If one wants to consider only the detector bboxes that have at least some overlap with the tracker bbox, for example, then
minMatchingScore4Iou would need to be set with a non-zero value. One can tune the other parameters in a similar manner.
Given a set of candidate detector bboxes for each tracker, the data association matrix is constructed between the detector bbox set and the tracker set with the matching scores as the value for the elements in the matrix. The matching score for each element is computed as a weighted sum of:
The visual similarity
The size similarity, and
IOU score with the corresponding weights in
The resulting matching score is put into the data association matrix only if the score exceeds a predefined threshold (i.e.,
DCF Core Tuning¶
Apart from the types and sizes of the visual features employed, there are parameters related to how to learn and update the classifier for each object in DCF frameworks, which would affect the accuracy.
DCF Filter Learning¶
DCF-based trackers learn a classifier (i.e., discriminative correlation filter) for each object with implicit positive and negative samples. Such learned classifiers are updated on-the-fly for temporal consistency with a predefined learning rate (i.e.,
filterLr). If the visual appearance of the target objects is expected to vary quickly over time, one may employ a high learning rate for better adaptation of the correlation filter to the changing appearance. However, there is a risk of learning the background quickly as well, resulting in potentially more frequent track drift.
As NvDCF tracker utilizes multi-channel visual features, it is of concern on how to merge those channels for the final correlation response. NvDCF employs an adaptive channel weight approach where the importance of each channel is examined on-the-fly, and the corresponding channel weights are updated over time with a pre-defined learning rate (i.e.,
filterChannelWeightsLr). The tuning strategy for this learning rate would be similar to the case of
filterLr as described before.
When a correlation filter is learned,
gaussianSigma determines how tight we want to fit the resulting filter to the positive sample. A lower value means the tighter fit, but it may result in overfitting. On the other hand, a higher value may result in lower discriminative power in the learned filter.
See also the Troubleshooting in NvDCF Parameter Tuning section for solutions to common problems in tracker behavior and tuning.