ReIdentificationNet#

ReIdentificationNet takes cropped images of a person from different perspectives as network input and outputs the embedding features for that person. The embeddings are used to perform similarity matching to re-identify the same person. The model supported in the current version is based on ResNet, which is the most commonly used baseline for re-identification due to its high accuracy.

The expected time to train ReIdentificationNet is as follows:

Backbone Type

GPU Type

No. of training images

Image Size

No. of identities

Batch size

Total Epochs

Total Training Time

Resnet50

1 x Nvidia A100 - 80GB PCIE

13,000

256x128x3

751

128

120

~1.25 hours

Resnet50

1 x Nvidia Quadro GV100 - 32GB

13,000

256x128x3

751

64

120

~2.5 hours

Data Input for ReIdentificationNet#

The ReIdentificationNet apps in TAO expect data in Market-1501 format for training and evaluation.

See the Data Annotation Format page for more information about the Market-1501 data format.

Creating an Experiment Spec File#

The spec file for ReIdentificationNet includes model, dataset, re_ranking, and train parameters. Here is an example spec for training a ResNet model on Market-1501 that contains 751 identities in the training set.

results_dir: "/path/to/experiment_results"
encryption_key: nvidia_tao
model:
  backbone: resnet_50
  last_stride: 1
  pretrain_choice: imagenet
  pretrained_model_path: "/path/to/pretrained_model.pth"
  input_channels: 3
  input_width: 128
  input_height: 256
  neck: bnneck
  feat_dim: 256
  neck_feat: after
  metric_loss_type: triplet
  with_center_loss: False
  with_flip_feature: False
  label_smooth: True
dataset:
  train_dataset_dir: "/path/to/train_dataset_dir"
  test_dataset_dir: "/path/to/test_dataset_dir"
  query_dataset_dir: "/path/to/query_dataset_dir"
  num_classes: 751
  batch_size: 64
  val_batch_size: 128
  num_workers: 1
  pixel_mean: [0.485, 0.456, 0.406]
  pixel_std: [0.226, 0.226, 0.226]
  padding: 10
  prob: 0.5
  re_prob: 0.5
  sampler: softmax_triplet
  num_instances: 4
re_ranking:
  re_ranking: True
  k1: 20
  k2: 6
  lambda_value: 0.3
train:
  results_dir: "${results_dir}/train"
  optim:
    name: Adam
    lr_monitor: val_loss
    steps: [40, 70]
    gamma: 0.1
    bias_lr_factor: 1
    weight_decay: 0.0005
    weight_decay_bias: 0.0005
    warmup_factor: 0.01
    warmup_iters: 10
    warmup_method: linear
    base_lr: 0.00035
    momentum: 0.9
    center_loss_weight: 0.0005
    center_lr: 0.5
    triplet_loss_margin: 0.3
  num_epochs: 10
  checkpoint_interval: 5
  validation_interval: 5
  seed: 1234

Parameter

Data Type

Default

Description

Supported Values

model

dict config

The configuration of the model architecture

dataset

dict config

The configuration of the dataset

train

dict config

The configuration of the training task

evaluate

dict config

The configuration of the evaluation task

inference

dict config

The configuration of the inference task

encryption_key

string

None

The encryption key to encrypt and decrypt model files

results_dir

string

/results

The directory where experiment results are saved

export

dict config

The configuration of the ONNX export task

re_ranking

dict config

The configuration for the re-ranking module

model#

The model parameter provides options to change the ReIdentificationNet architecture.

model:
  backbone: resnet_50
  last_stride: 1
  pretrain_choice: imagenet
  pretrained_model_path: "/path/to/pretrained_model.pth"
  input_channels: 3
  input_width: 128
  input_height: 256
  neck: bnneck
  feat_dim: 256
  neck_feat: after
  metric_loss_type: triplet
  with_center_loss: False
  with_flip_feature: False
  label_smooth: True

Parameter

Datatype

Default

Description

Supported Values

backbone

string

resnet_50

The type of model, which can be resnet_50 or a Swin-based architecture (refer to ReIdentificationNet Transformer for more details)

“resnet_50”, “swin_base_patch4_window7_224”, “swin_small_patch4_window7_224, “swin_tiny_patch4_window7_224”

last_stride

unsigned int

1

The number of strides during convolution

>0

pretrain_choice

string

imagenet

The pre-trained network

imagenet/self/””

pretrained_model_path

string

The path to the pre-trained model

input_channels

unsigned int

3

The number of input channels

>0

input_width

int

128

The width of the input images

>0

input_height

int

256

The height of the input images

>0

neck

string

bnneck

Specifies whether to train with BNNeck

bnneck/””

feat_dim

unsigned int

256

The output size of the feature embeddings

>0

neck_feat

string

after

Specifies which feature of BNNeck to use for testing

before/after

metric_loss_type

string

triplet

The type of metric loss

triplet/center/triplet_center

with_center_loss

bool

False

Specifies whether to enable center loss

True/False

with_flip_feature

bool

False

Specifies whether to enable image flipping

True/False

label_smooth

bool

True

Specifies whether to enable label smoothing

True/False

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_dataset_dir: "/path/to/train_dataset_dir"
  test_dataset_dir: "/path/to/test_dataset_dir"
  query_dataset_dir: "/path/to/query_dataset_dir"
  num_classes: 751
  batch_size: 64
  val_batch_size: 128
  num_workers: 1
  pixel_mean: [0.485, 0.456, 0.406]
  pixel_std: [0.226, 0.226, 0.226]
  padding: 10
  prob: 0.5
  re_prob: 0.5
  sampler: softmax_triplet
  num_instances: 4

Parameter

Datatype

Default

Description

Supported Values

train_dataset_dir

string

The path to the train images

test_dataset_dir

string

The path to the test images

query_dataset_dir

string

The path to the query images

num_classes

unsigned int

751

The number of unique person IDs

>0

batch_size

unsigned int

64

The batch size for training

>0

val_batch_size

unsigned int

128

The batch size for validation

>0

num_workers

unsigned int

1

The number of parallel workers processing data

>0

pixel_mean

float list

[0.485, 0.456, 0.406]

The pixel mean for image normalization

float list

pixel_std

float list

[0.226, 0.226, 0.226]

The pixel standard deviation for image normalization

float list

padding

unsigned int

10

The pixel padding size around images for image augmentation

>=1

prob

float

0.5

The random horizontal flipping probability for image augmentation

>0

re_prob

float

0.5

The random erasing probability for image augmentation

>0

sampler

string

softmax_triplet

The type of sampler for data loading

softmax/triplet/softmax_triplet

num_instances

unsigned int

4

The number of image instances of the same person in a batch

>0

re_ranking#

The re_ranking parameter defines the settings for the re-ranking module.

re_ranking:
  re_ranking: True
  k1: 20
  k2: 6
  lambda_value: 0.3

Parameter

Datatype

Default

Description

Supported Values

re_ranking

bool

True

A flag that enables the re-ranking module

True/False

k1

unsigned int

20

The k used for k-reciprocal nearest neighbors

>0

k2

unsigned int

6

The k used for local query expansion

>0

lambda_value

float

0.3

The weight of original distance in the combination with Jaccard distance

>0.0

train#

The train parameter defines the hyperparameters of the training process.

train:
  optim:
    name: Adam
    lr_monitor: val_loss
    steps: [40, 70]
    gamma: 0.1
    bias_lr_factor: 1
    weight_decay: 0.0005
    weight_decay_bias: 0.0005
    warmup_factor: 0.01
    warmup_iters: 10
    warmup_method: linear
    base_lr: 0.00035
    momentum: 0.9
    center_loss_weight: 0.0005
    center_lr: 0.5
    triplet_loss_margin: 0.3
  num_epochs: 10
  checkpoint_interval: 5
  validation_interval: 5
  seed: 1234

Parameter

Datatype

Default

Description

Supported Values

num_gpus

unsigned int

1

The number of GPUs to use for distributed training

>0

gpu_ids

List[int]

[0]

The indices of the GPU’s to use for distributed training

seed

unsigned int

1234

The random seed for random, NumPy, and torch

>0

num_epochs

unsigned int

10

The total number of epochs to run the experiment

>0

checkpoint_interval

unsigned int

1

The epoch interval at which the checkpoints are saved

>0

validation_interval

unsigned int

1

The epoch interval at which the validation is run

>0

resume_training_checkpoint_path

string

The intermediate PyTorch Lightning checkpoint to resume training from

results_dir

string

/results/train

The directory to save training results

optim

dict config

The configuration for the SGD optimizer, including the learning rate, learning scheduler, weight decay, etc.

clip_grad_norm

float

0.0

The amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping.

>=0

optim#

The optim parameter defines the config for the SGD optimizer in training, including the learning rate, learning scheduler, and weight decay.

optim:
  name: Adam
  lr_monitor: val_loss
  lr_steps: [40, 70]
  gamma: 0.1
  bias_lr_factor: 1
  weight_decay: 0.0005
  weight_decay_bias: 0.0005
  warmup_factor: 0.01
  warmup_iters: 10
  warmup_method: linear
  base_lr: 0.00035
  momentum: 0.9
  center_loss_weight: 0.0005
  center_lr: 0.5
  triplet_loss_margin: 0.3

Parameter

Datatype

Default

Description

Supported Values

name

string

Adam

The name of the optimizer

Adam/SGD/Adamax/…

lr_monitor

string

val_loss

The monitor value for the AutoReduce scheduler

val_loss/train_loss

lr_steps

int list

[40, 70]

The steps to decrease the learning rate for the MultiStep scheduler

int list

gamma

float

0.1

The decay rate for the WarmupMultiStepLR

>0.0

bias_lr_factor

float

1

The bias learning rate factor for the WarmupMultiStepLR

>=1

weight_decay

float

0.0005

The weight decay coefficient for the optimizer

>0.0

weight_decay_bias

float

0.0005

The weight decay bias for the optimizer

>0.0

warmup_factor

float

0.01

The warmup factor for the WarmupMultiStepLR scheduler

>0.0

warmup_iters

unsigned int

10

The number of warmup iterations for the WarmupMultiStepLR scheduler

>0

warmup_method

string

linear

The warmup method for the optimizer

linear/cosine

base_lr

float

0.00035

The initial learning rate for the training

>0.0

momentum

float

0.9

The momentum for the WarmupMultiStepLR optimizer

>0.0

center_loss_weight

float

0.0005

The balanced weight of center loss

>0.0

center_lr

float

0.5

The learning rate of SGD to learn the centers of center loss

>0.0

triplet_loss_margin

float

0.3

The margin value for triplet loss

>0.0

Training the Model#

Use the following command to run ReIdentificationNet training:

tao model re_identification train [-h] -e <experiment_spec>
                            [results_dir=<global_results_dir>]
                            [model.<model_option>=<model_option_value>]
                            [dataset.<dataset_option>=<dataset_option_value>]
                            [train.<train_option>=<train_option_value>]
                            [train.gpu_ids=<gpu indices>]
                            [train.num_gpus=<number of gpus>]

Required Arguments#

  • -e, --experiment_spec_file: The path to the experiment spec file.

Optional Arguments#

You can set optional arguments to override the option values in the experiment spec file.

Note

For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus and gpu_ids, which default to 1 and [0], respectively. If both are passed, but inconsistent, for example num_gpus = 1, gpu_ids = [0, 1], then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2.

Checkpointing and Resuming Training#

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. These are saved in train.results_dir, like so:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

The latest checkpoint is saved as reid_model_latest.pth. Training automatically resumes from reid_model_latest.pth, if it exists in train.results_dir. This is superseded by train.resume_training_checkpoint_path, if it is provided.

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:

  • Specify a new, empty results directory (Recommended)

  • Remove the latest checkpoint from the results directory

Evaluating the Model#

The evaluation metric of ReIdentificationNet is the mean average precision and ranked accuracy. The plots of sampled matches and the cumulative matching characteristic (CMC) curve can be obtained using the evaluate.output_sampled_matches_plot and evaluate.output_cmc_curve_plot parameters, respectively.

Use the following command to run ReIdentificationNet evaluation:

tao model re_identification evaluate [-h] -e <experiment_spec_file>
                            evaluate.checkpoint=<model to be evaluated>
                            evaluate.output_sampled_matches_plot=<path to the output sampled matches plot>
                            evaluate.output_cmc_curve_plot=<path to the output CMC curve plot>
                            evaluate.test_dataset=<path to test data>
                            evaluate.query_dataset=<path to query data>
                            [evaluate.<evaluate_option>=<evaluate_option_value>]
                            [evaluate.gpu_ids=<gpu indices>]
                            [evaluate.num_gpus=<number of gpus>]

Multi-GPU evaluation is not supported for Re-Identification.

Required Arguments#

  • -e, --experiment_spec_file: The experiment spec file to set up the evaluation experiment

  • evaluate.checkpoint: The .pth model

  • evaluate.output_sampled_matches_plot: The path to the plotted file of sampled matches

  • evaluate.output_cmc_curve_plot: The path to the plotted file of the CMC curve

  • evaluate.test_dataset: The path to the test data

  • evaluate.query_dataset: The path to the query data

Optional Argument#

  • evaluate.gpu_ids: The GPU indices to run evaluation. Defaults to [0].

  • evaluate.num_gpus: The number of GPUs to run evaluation. Defualts to 1.

  • evaluate.results_dir: The directory to save the evaluation results. Defaults to /results/evaluate.

Running Inference on the Model#

Use the following command to run inference on ReIdentificationNet with the .tlt model.

tao model re_identification inference [-h] -e <experiment_spec>
                            inference.checkpoint=<inference model>
                            inference.output_file=<path to output file>
                            inference.test_dataset=<path to gallery data>
                            inference.query_dataset=<path to query data>
                            [inference.<infer_option>=<infer_option_value>]
                            [inference.gpu_ids=<gpu indices>]
                            [inference.num_gpus=<number of gpus>]

The output is a JSON file that contains the feature embeddings of all the test and query data.

Multi-GPU inference is currently not supported for Re-Identification.

Required Arguments#

  • -e, --experiment_spec: The experiment spec file to set up inference

  • inference.checkpoint: The .pth model to perform inference with

  • inference.output_file: The path to the output JSON file

  • inference.test_dataset: The path to the test data

  • inference.query_dataset: The path to the query data

Optional Argument#

  • inference.gpu_ids: The GPU indices to run inference. Defaults to [0].

  • inference.num_gpus: The number of GPUs to run inference. Defualts to 1.

  • inference.results_dir: The directory to save the inference results. Defaults to /results/inference.

The expected output would be as follows:

[
  {
    "img_path": "/path/to/img1.jpg",
    "embedding": [-0.30, 0.12, 0.13,...]
  },
  {
    "img_path": "/path/to/img2.jpg",
    "embedding": [-0.10, -0.06, -1.85,...]
  },
  ...
  {
    "img_path": "/path/to/imgN.jpg",
    "embedding": [1.41, 0.63, -0.15,...]
  }
]

Exporting the Model#

Use the following command to export ReIdentificationNet to .onnx format for deployment:

tao model re_identification export -e <experiment_spec>
                            export.checkpoint=<tlt checkpoint to be exported>
                            export.onnx_file=<path to exported file>
                            [export.gpu_id=<gpu index>]

Required Arguments#

  • -e, --experiment_spec: The experiment spec file to set up export.

  • export.checkpoint: The .pth model to be exported.

  • export.onnx_file: The path to save the exported model to. The default path is in the same directory as the \*.pth model.

Optional Arguments#

  • export.gpu_id: The index of the GPU that will be used to run the export. You can specify this value when the machine has multiple GPUs installed. Note that export can only run on a single GPU.

Here’s an example of using the ReIdentificationNet export command:

Deploying the Model#

You can deploy the trained deep -earning and computer-vision models on edge devices–such as a Jetson Xavier, Jetson Nano, or Tesla–or in the cloud with NVIDIA GPUs. The exported *.onnx model can also be used with TAO Triton Apps.

Running ReIdentificationNet Inference on the Triton Sample#

The TAO Triton Apps provide an inference sample for ReIdentificationNet. It consumes a TensorRT engine and supports running with a directory of query (probe) images and a directory of test (gallery) images containing the same identities.

To use this sample, you need to generate the TensorRT engine from an *.onnx model using trtexec.

Generating TensorRT Engine Using trtexec#

For instructions on generating a TensorRT engine using the trtexec command, refer to the trtexec guide for ReIdentificationNet.

Running the Triton Inference Sample#

You can generate the TensorRT engine when starting the Triton server using the following command:

bash scripts/start_server.sh

When the server is running, you can get results from a directory of query images and a directory of test images using the following command with a client:

python tao_client.py <path_to_query_directory> \
                    --test_dir <path_to_test_directory>
                    -m re_identification_tao model \
                    -x 1 \
                    -b 16 \
                    --mode Re_identification \
                    -i https \
                    -u localhost:8000 \
                    --async \
                    --output_path <path_to_output_directory>

Note

The server will perform inference on the input image directories. The results are saved as a JSON file. The following is a sample of the JSON output:

[
  ...,
  {
    "img_path": "/localhome/Data/market1501/query/1121_c3s2_156744_00.jpg",
    "embedding": [-1.1530249118804932, -1.8521332740783691,..., 0.380886435508728]
  },...
  {
    "img_path": "/localhome/Data/market1501/bounding_box_test/1377_c2s3_038007_05.jpg",
    "embedding": [0.09496910870075226, 0.26107653975486755,..., 0.2835155725479126]
  },...
]

End-to-End Inference Using Triton#

The TAO Triton Apps provides a sample for end-to-end inference from a directory of query images and a directory of test images. The sample downloads the Market-1501 dataset and randomly samples a subset of 100 identities. The client implicitly converts the image samples into arrays and sends them to the Triton server. The feature embedding for each image is returned and saved to the JSON output. An image of sampled matches and a figure of the CMC curve is also generated for visualization.

You can start the Triton server using the following command (only the ReIdentificationNet model will be downloaded and converted into a TensorRT engine):

bash scripts/re_id_e2e_inference/start_server.sh

Once the Triton server has started, open another terminal and use the following command to run re-identification on the query and test images using the Triton server instance that you have previously spun up:

bash scripts/re_id_e2e_inference/start_client.sh