Metric Learning Recognition#

Metric Learning Recognition (MLRecogNet) is a classifier that encodes the input image to embedding vectors and predicts their labels based on the embedding vectors in the reference space. MLRecogNet consists of two parts:

Trunk: A backbone network that encodes the input image to a feature vector.
Embedder: A fully connected layer that maps the feature vector to the embedding space.

The embedding space is a high-dimensional space where the distance between the embedding vectors of the same class is small and the distance between the embedding vectors of different classes is large. The embedder is trained to minimize the distance between the embedding vectors of the same class and maximize the distance between the embedding vectors of different classes. The embedding vectors of the query images are compared with the embedding vectors of the reference images to predict the labels of the query images.

The current supported trunk is ResNet, which is the most commonly used baseline for vision classification. And the current supported embedder is a one-layer MLP.

During training, evaluation, and inference, MLRecogNet requires a reference set and a query set for validation or test. The reference set consists of a collection of labeled images, while the query set refers to a group of unlabeled images–the goal is to predict the labels of the unlabeled images by comparing their similarity to the embedding vectors of the reference set generated by trained MLRecogNet.

Preparing the Dataset#

MLRecogNet requires cropped images from the detection set or classification set as input. These images are resized to 224x224 by default for model input. Augmentation is applied to each image during training.

The data should be organized in the following structure:

/Dataset
    /reference
      /class1
            0001.jpg
            0002.jpg
            ...
            0100.jpg
      /class2
            0001.jpg
            0002.jpg
            ...
            0100.jpg
        ...
    /train
      /class1
            0101.jpg
            0102.jpg
            ...
            0200.jpg
      /class2
            0101.jpg
            0102.jpg
            ...
            0200.jpg
    /val
      /class1
            0201.jpg
            0202.jpg
            ...
            0220.jpg
      /class2
            0201.jpg
            0202.jpg
            ...
            0220.jpg
    /test
      /class1
            0301.jpg
            0302.jpg
            ...
            0400.jpg
      /class2
            0301.jpg
            0302.jpg
            ...
            0400.jpg

The root directory of the dataset contains sub-directories for reference, training, validation, and test. The sub-directories are required to be in ImageNet structure, as demonstrated above. Each sub-directory has images of the same class. If the classes in test set are not in the reference set, the queried images cannot be correctly recognized.

Creating an Experiment Specification File#

The specification file for MLRecogNet includes model, train, and dataset parameters.

Parameter	Data Type	Default	Description	Supported Values
`model`	dict config	–	The configuration of the model architecture
`dataset`	dict config	–	The configuration of the dataset
`train`	dict config	–	The configuration of the training task
`evaluate`	dict config	–	The configuration of the evaluation task
`inference`	dict config	–	The configuration of the inference task
`encryption_key`	string	None	The encryption key to encrypt and decrypt model files
`results_dir`	string	/results	The directory where experiment results are saved
`export`	dict config	–	The configuration of the ONNX export task
`gen_trt_engine`	dict config	–	The configuration of the TensorRT generation task

model#

The model parameter provides options to change the MetricLearningRecognition architecture.

model:
  backbone: resnet_50
  pretrained_model_path: "/path/to/pretrained_model.pth"
  pretrained_embedder_path: null
  pretrained_trunk_path: null
  input_channels: 3
  input_width: 224
  input_height: 224
  feat_dim: 256

Parameter	Datatype	Default	Description	Supported Values
`backbone`	string	resnet_50	Backbone (trunk) model type.	resnet_50, resnet_101, fan_small, fan_base, fan_large, fan_tiny, nvdinov2_vit_large_legacy
`pretrained_model_path`	string		The path to the pretrained model. The weights are only loaded to the full model
`pretrained_trunk_path`	string		The path to the pretrained trunk. The weights are only loaded to the trunk part.
`pretrained_embedder_path`	string		The path to the pretrained embedder. The weights are only loaded to the embedder part.
`input_channels`	unsigned int	3	The number of input channels	>0
`input_width`	int	224	The input width of the images	int
`input_height`	int	224	The input height of the images	int
`feat_dim`	unsigned int	256	The output size of the feature embeddings	>0

train#

The train parameter defines the hyperparameters of the training process.

train:
  optim:
    name: Adam
    steps: [40, 70]
    gamma: 0.1
    warmup_factor: 0.01
    warmup_iters: 10
    warmup_method: 'linear'
    triplet_loss_margin: 0.3
    miner_function_margin: 0.1
    embedder:
      bias_lr_factor: 1
      base_lr: 0.000001
      momentum: 0.9
      weight_decay: 0.0001
      weight_decay_bias: 0.0005
    trunk:
      bias_lr_factor: 1
      base_lr: 0.00001
      momentum: 0.9
      weight_decay: 0.0001
      weight_decay_bias: 0.0005
  num_epochs: 10
  checkpoint_interval: 5
  validation_interval: 5
  clip_grad_norm: 0.0
  resume_training_checkpoint_path: null
  report_accuracy_per_class: True
  smooth_loss: True
  batch_size: 64
  val_batch_size: 64
  train_trunk: false
  train_embedder: true
  results_dir: null
  seed: 1234

Parameter	Datatype	Default	Description	Supported Values
`num_gpus`	unsigned int	1	The number of GPUs to use for distributed training	>0
`gpu_ids`	List[int]	[0]	The indices of the GPU’s to use for distributed training
`seed`	unsigned int	1234	The random seed for random, NumPy, and torch	>0
`num_epochs`	unsigned int	10	The total number of epochs to run the experiment	>0
`checkpoint_interval`	unsigned int	1	The epoch interval at which the checkpoints are saved	>0
`validation_interval`	unsigned int	1	The epoch interval at which the validation is run	>0
`resume_training_checkpoint_path`	string		The intermediate PyTorch Lightning checkpoint to resume training from
`results_dir`	string	/results/train	The directory to save training results
`optim`	dict config	–	The configuration for the torch optimizer (Optim Config), including the learning rate, learning scheduler, weight decay, etc.
`clip_grad_norm`	float	0.0	The amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping.	>=0
`report_accuracy_per_class`	bool	True	If True, the top1 precision of each class will be reported.	True/False
`smooth_loss`	bool	True	If True, the log-exp version of the triplet loss will be used.	True/False
`batch_size`	unsigned int	64	The batch size for training	>0
`val_batch_size`	unsigned int	64	The batch size for validation	>0
`train_trunk`	bool	True	If False, the trunk part of the model would be frozen during training	True/False
`train_embedder`	bool	True	If False, the embedder part of the model would be frozen during training	True/False

optim#

The optim parameter defines the configuration for the Torch optimizer in training, including the learning rate, learning scheduler, and weight decay.

optim:
  name: Adam
  steps: [40, 70]
  gamma: 0.1
  warmup_factor: 0.01
  warmup_iters: 10
  warmup_method: 'linear'
  triplet_loss_margin: 0.3
  miner_epsilon: 0.1
  embedder:
    bias_lr_factor: 1
    base_lr: 0.00035
    momentum: 0.9
    weight_decay: 0.0005
    weight_decay_bias: 0.0005
  trunk:
    bias_lr_factor: 1
    base_lr: 0.00035
    momentum: 0.9
    weight_decay: 0.0005
    weight_decay_bias: 0.0005

Parameter	Datatype	Default	Description	Supported Values
`name`	string	Adam	The name of the optimizer. The Algorithms in `torch.optim` are supported.	Adam/SGD/Adamax/…
`steps`	int list	[40, 70]	The steps to decrease the learning rate for the `MultiStep` scheduler
`gamma`	float	0.1	The decay rate for the `WarmupMultiStepLR` scheduler	>0.0
`warmup_factor`	float	0.01	The warmup factor for the `WarmupMultiStepLR` scheduler	>0.0
`warmup_iters`	unsigned int	10	The number of warmup iterations for the `WarmupMultiStepLR` scheduler	>0
`warmup_method`	string	linear	The warmup method for the optimizer	constant/linear
`triplet_loss_margin`	float	0.3	The desired difference between the anchor-positive distance and the anchor-negative distance	>0.0
`miner_function_margin`	float	0.1	Negative pairs are chosen if they have similarity greater than the hardest positive pair, minus this margin; positive pairs are chosen if they have similarity less than the hardest negative pair, plus the margin	>0.0
`embedder`	dict config	–	The learning rate configurations (LR Config) for the MLRecogNet embedder
`trunk`	dict config	–	The learning rate configurations (LR Config) for MLRecogNet trunk

LR Config#

Parameter	Datatype	Default	Description	Supported Values
`base_lr`	float	0.00035	The initial learning rate for the training	>0.0
`bias_lr_factor`	float	1	The bias learning rate factor for the WarmupMultiStepLR	>=1
`momentum`	float	0.9	The momentum for the WarmupMultiStepLR optimizer	>0.0
`weight_decay`	float	0.0005	The weight decay coefficient for the optimizer	>0.0
`weight_decay_bias`	float	0.0005	The weight decay bias for the optimizer	>0.0

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_dataset: /path/to/dataset/train
  val_dataset:
    reference: /path/to/dataset/reference
    query: /path/to/dataset/val
  workers: 8
  pixel_mean: [0.485, 0.456, 0.406]
  pixel_std: [0.226, 0.226, 0.226]
  padding: 10
  prob: 0.5
  re_prob: 0.5
  sampler: softmax_triplet
  num_instance: 4
  gaussian_blur:
    enabled: True
    kernel: [15, 15]
    sigma: [0.3, 0.7]
  color_augmentation:
    enabled: True
    brightness: 0.5
    contrast: 0.3
    saturation: 0.1
    hue: 0.1

Parameter	Datatype	Default	Description	Supported Values
`train_dataset`	string		The path to the train dataset. This field is only required for the train task.
`val_dataset`	dict		The map of reference set and query set addresses. For training and evaluation, both fields are required. For inference, only the reference set address is needed.	{“reference”: /path/to/reference/set, “query”: “”}
`workers`	unsigned int	8	The number of parallel workers processing data	>0
`class_map`	string		A YAML file mapping dataset class names to desired class names. If not specified, the reported class names are the folder names in the dataset folder.
`pixel_mean`	float list	[0.485, 0.456, 0.406]	The pixel mean for image normalization	float list
`pixel_std`	float list	[0.226, 0.226, 0.226]	The pixel standard deviation for image normalization	float list
`num_instance`	unsigned int	4	The number of image instances of the same person in a batch	>0
`prob`	float	0.5	The random horizontal flipping probability for image augmentation	>0
`re_prob`	float	0.5	The random erasing probability for image augmentation	>0
`random_rotation`	bool	True	If True, random rotations at 0 ~ 180 degrees to the input data are applied	True/False
`gaussian_blur`	dict config	–	The configuration of the Gaussian blur augmentation on input samples
`color_augmentation`	dict config	–	The configuration of the color augmentation on input samples

Gaussian Blur Config#

Parameter	Datatype	Default	Description	Supported Values
`enabled`	bool	True	If True, applies Gaussian blur augmentation to input samples	True/False
`kernel`	unsigned int list	[15, 15]	The kernel size for the Gaussian blur
`sigma`	float list	[0.3, 0.7]	The sigma value range for the Gaussian blur

Color Augmentation Config#

Parameter	Datatype	Default	Description	Supported Values
`enabled`	bool	True	If True, applies color augmentation to input samples	True/False
`brightness`	float	0.5	The value of jittering brightness	>=0
`contrast`	float	0.3	The value of jittering contrast	>=0
`saturation`	float	0.1	The value of jittering saturation	>=0
`hue`	float	0.1	The value of jittering hue	>=0, <=0.5

Evaluating the Model#

Here is an example spec $EVAL_SPEC for evaluating an MLRecogNet model on a test dataset.

results_dir: /path/to/root/results/dir
model:
  backbone: resnet_50
  input_width: 224
  input_height: 224
  feat_dim: 256
dataset:
  workers: 8
  val_dataset:
    reference: /path/to/dataset/reference
    query: /path/to/dataset/val
evaluate:
  checkpoint: /path/to/checkpoint
  batch_size: 128
  results_dir: /path/to/results

Parameter	Datatype	Default	Description	Supported Values
`checkpoint`	string	None	The path to the .pth Torch model to be evaluated
`results_dir`	string	/results/evaluate	The directory to save evaluation results
`num_gpus`	unsigned int	1	The number of GPUs to use for distributed evaluation	>0
`gpu_ids`	List[int]	[0]	The indices of the GPU’s to use for distributed evaluation
`trt_engine`	string	None	The path to the TensorRT (TRT) engine to be evaluated
`topk`	int	1	If greater than 1, the accuracy will be top-k precision	>0
`batch_size`	int	64	The batch size for the evaluation task	>0
`report_accuracy_per_class`	bool	True	If True, the top-1 precision of each class will be reported	True/False

The following are evaluation metrics for MLRecogNet:

Adjusted Mutual Information (AMI): A measure used in statistics and information theory to quantify the agreement between two assignments, such as cluster assignments, which is adjusted for chance and therefore provides a more accurate depiction of the similarity between the two compared to raw mutual information.
Normalized Mutual Information (NMI): A normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation).
Mean Average Precision: The average precision achieved by a model across different recall levels, providing a comprehensive evaluation of its performance on information retrieval.
Mean Average Precision at r: A model’s average precision for the top-R ranked results, offering insight into the effectiveness of the retrieval or object detection performance of the model when considering a limited number of results.
Mean Reciprocal Rank: The average of the inverse ranks of the first relevant result for a set of queries, emphasizing the importance of retrieving relevant information as early as possible.
Precision at 1: The accuracy of the nearest neighbor retrievals.
R Precision: An evaluation metric for information retrieval systems that measures the proportion of relevant documents among the top-R ranked results, where “”R corresponds to the total number of relevant documents for a given query.

When evaluate.report_accuracy_per_class is set to True, the accuracy of each class is added.

Running Inference on the Model#

Here is an example spec $INFERENCE_SPEC for running MLRecogNet model inference on an inference set:

results_dir: /path/to/root/results/dir
model:
  backbone: resnet_50
  input_width: 224
  input_height: 224
  feat_dim: 256
dataset:
  workers: 8
  val_dataset:
    reference: /path/to/dataset/reference
    query: ""
inference:
  input_path: /path/to/dataset/test
  inference_input_type: classification_folder
  checkpoint: /path/to/model/checkpoint
  results_dir: /path/to/results/dir
  batch_size: 128

Parameter	Datatype	Default	Description	Supported Values
`checkpoint`	string	None	The path to the .pth torch model to run inference
`results_dir`	string	/results/inference	The directory to save inference results
`num_gpus`	unsigned int	1	The number of GPUs to use for distributed inference	>0
`gpu_ids`	List[int]	[0]	The indices of the GPU’s to use for distributed inference
`trt_engine`	string	None	The path to the TensorRT (TRT) engine to run inference
`input_path`	string		The path to the data to run inference on	>0
`inference_input_type`	string	“image_folder”	Three options are supported: `image_folder`: Used when `input_path` is a folder of images. `classification_folder`: Used when `input_path` is an `ImageNet` structured folder. `image`: Used when `input_path` is an image file	“image_folder”/”classification_folder”/”image”
`batch_size`	int	64	The batch size for the inference task	>0
`topk`	int	1	The number of top results to be returned	>0

Exporting the Model#

Here is an example spec $EXPORT_SPEC for exporting the MLRecogNet model:

results_dir: /path/to/root/results/dir
model:
  backbone: resnet_50
  input_width: 224
  input_height: 224
  feat_dim: 256
export:
  checkpoint: /path/to/checkpoint
  onnx_file: /path/to/results/model.onnx
  results_dir:  /path/to/results
  batch_size: -1
  on_cpu: false
  verbose: true

Parameter	Datatype	Default	Description	Supported Values
`checkpoint`	string	None	the path to the .pth Torch model to be evaluated
`onnx_file`	string	None	The path to the exported ONNX file. If this value is not specified, it defaults to `model.onnx` in `export.results_dir`
`batch_size`	int	-1	The batch size of the exported ONNX model. If `batch_size` is -1, the exported ONNX model has a dynamic batch size.	>0; -1
`gpu_id`	unsigned int	0	The GPU ID for Torch-to-ONNX export. Currently, the export task only supports running on a single GPU	>=0
`on_cpu`	bool	False	If True, the Torch-to-ONNX export will be performed on CPU	True/False
`opset_version`	unsigned int	14	The version of the default (ai.onnx) opset to target	>= 7 and <= 16.
`verbose`	bool	True	If True, prints a description of the model being exported to `stdout`.	True/False
`results_dir`	string	None	The path to the results directory of the export task