ReIdentificationNet_Transformer

Like ReIdentificationNet, the ReidentificationNet_Transformer models generate embeddings to identify objects captured in different scenes.

The models have Swin-based backbones which take in cropped images of objects as input produces feature embeddings as output.

The ReidentificationNet_Transformer models were first pre-trained on NVIDIA’s NV-PeopleCrops dataset with ~3.2M unlabeled images from Market-1501, Open Images, etc. The pseudo-label bounding boxes and body keypoints were generated using the state-of-the-art models such as YOLOv8 and HRNet. Then we run training in a supervised manner using the same dataset as ReIdentificationNet (751 identities from Market-1501 and 156 identities from the MTMC people tracking dataset of the 2023 AI City Challenge).

Model Card

The datasheet for the models is captured in the model card hosted at NGC, which includes the detailed instructions to deploy the models with DeepStream.

TAO Fine-Tuning

ReIdentificationNet Transformer uses the same training backend ReIdentificationNet resnet. Refer to the ReIdentification fine-tuning guide guide to retrain/fine-tune your network. Also refer to the TAO tutorial notebook and TAO documentation for more details.

TAO Toolkit is supported on discrete GPUs, such as H100, A100, A40, A30, A2, A16, A100x, A30x, V100, T4, Titan-RTX, and Quadro-RTX. Refer to the TAO toolkit documentation for more details on the recommended hardware requirements. The expected time to fine-tune ReIdentificationNet Transformer model is as follows:

Backbone Type	GPU Type	No. of training images	Image Size	No. of identities	Batch size	Total Epochs	Total Training Time
Swin Tiny	1 x Nvidia A100 - 80GB PCIE	13,000	256x128x3	751	128	120	~1.5 hours
Swin Tiny	1 x Nvidia Quadro GV100 - 32GB	13,000	256x128x3	751	64	120	~3 hours

You can refer to the ReID GT generation section if you want to collect labelled ReID GT data on objects captured from real video scenes.

Accuracy

The goal of re-identification is to identify test samples of the same identities for each query.

The key performance indicators are the ranked accuracy of re-identification and the mean average precision (mAP).

Rank-K accuracy: It is method of computing accuracy where the top-K highest confidence labels are matched with a ground truth label. If the ground truth label falls in one of these top-K labels, we state that this prediction is accurate. It allows us to get an overall accuracy measurement while being lenient on the predictions if the number of classes are too high and too similar. In our case, we compute rank-1, 5 and 10 accuracies. This means in case of rank-10, for a given sample, if the top-10 highest confidence labels predicted, match the label of ground truth, this sample will be counted as a correct measurement.

Mean average precision: Precision measures how accurate predictions are, in our case the logits of ID of an object. In other words, it measures the percentage of the predictions that are correct. mAP is the average of average precision (AP) where AP is computed for each class, in our case ID.

The experimental results on the test set of Market-1501 are listed as follows.

backbone	mAP	rank-1 accuracy
Swin Base	94.3%	96.0%
Swin Tiny	93.8%	95.6%