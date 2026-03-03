Metric Learning Recognition#

Metric Learning Recognition (MLRecogNet) is a classifier that encodes the input image to embedding vectors and predicts their labels based on the embedding vectors in the reference space. MLRecogNet consists of two parts:

Trunk : A backbone network that encodes the input image to a feature vector.

Embedder: A fully connected layer that maps the feature vector to the embedding space.

The embedding space is a high-dimensional space where the distance between the embedding vectors of the same class is small and the distance between the embedding vectors of different classes is large. The embedder is trained to minimize the distance between the embedding vectors of the same class and maximize the distance between the embedding vectors of different classes. The embedding vectors of the query images are compared with the embedding vectors of the reference images to predict the labels of the query images.

The current supported trunk is ResNet, which is the most commonly used baseline for vision classification. And the current supported embedder is a one-layer MLP.

During training, evaluation, and inference, MLRecogNet requires a reference set and a query set for validation or test. The reference set consists of a collection of labeled images, while the query set refers to a group of unlabeled images–the goal is to predict the labels of the unlabeled images by comparing their similarity to the embedding vectors of the reference set generated by trained MLRecogNet.

Preparing the Dataset# MLRecogNet requires cropped images from the detection set or classification set as input. These images are resized to 224x224 by default for model input. Augmentation is applied to each image during training. The data should be organized in the following structure: /Dataset /reference /class1 0001 .jpg 0002 .jpg ... 0100 .jpg /class2 0001 .jpg 0002 .jpg ... 0100 .jpg ... /train /class1 0101 .jpg 0102 .jpg ... 0200 .jpg /class2 0101 .jpg 0102 .jpg ... 0200 .jpg /val /class1 0201 .jpg 0202 .jpg ... 0220 .jpg /class2 0201 .jpg 0202 .jpg ... 0220 .jpg /test /class1 0301 .jpg 0302 .jpg ... 0400 .jpg /class2 0301 .jpg 0302 .jpg ... 0400 .jpg The root directory of the dataset contains sub-directories for reference, training, validation, and test. The sub-directories are required to be in ImageNet structure, as demonstrated above. Each sub-directory has images of the same class. If the classes in test set are not in the reference set, the queried images cannot be correctly recognized.

SPECS = $( tao metric_learning_recognition get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default' ) TAO Launcher Here is an example spec $TRAIN_SPEC for training a MLRecogNet model on a target dataset. results_dir : "???" model : backbone : resnet_101 pretrained_model_path : /path/to/resnet101_pretrained_mlrecog.pth.tar input_width : 224 input_height : 224 feat_dim : 2048 train : optim : name : Adam steps : [ 40 , 70 ] gamma : 0.1 embedder : bias_lr_factor : 1 weight_decay : 0.0001 weight_decay_bias : 0.0005 base_lr : 0.000001 momentum : 0.9 trunk : bias_lr_factor : 1 weight_decay : 0.0001 weight_decay_bias : 0.0005 base_lr : 0.00001 momentum : 0.9 warmup_factor : 0.01 warmup_iters : 10 warmup_method : linear triplet_loss_margin : 0.3 miner_function_margin : 0.1 num_epochs : 10 resume_training_checkpoint_path : null checkpoint_interval : 5 validation_interval : 5 smooth_loss : False batch_size : 16 val_batch_size : 16 seed : 1234 dataset : train_dataset : /path/to/dataset/train val_dataset : reference : /path/to/dataset/reference query : /path/to/dataset/val workers : 12 pixel_mean : [ 0.485 , 0.456 , 0.406 ] pixel_std : [ 0.226 , 0.226 , 0.226 ] prob : 0.5 re_prob : 0.5 num_instance : 4 color_augmentation : enabled : True brightness : 0.5 contrast : 0.3 saturation : 0.1 hue : 0.1 gaussian_blur : enabled : True kernel : [ 15 , 15 ] sigma : [ 0.3 , 0.7 ] random_rotation : True class_map : /path/to/class_map.yaml Parameter Data Type Default Description Supported Values model dict config – The configuration of the model architecture dataset dict config – The configuration of the dataset train dict config – The configuration of the training task evaluate dict config – The configuration of the evaluation task inference dict config – The configuration of the inference task encryption_key string None The encryption key to encrypt and decrypt model files results_dir string /results The directory where experiment results are saved export dict config – The configuration of the ONNX export task gen_trt_engine dict config – The configuration of the TensorRT generation task. Only used in tao deploy model# The model parameter provides options to change the MetricLearningRecognition architecture. model : backbone : resnet_50 pretrained_model_path : "/path/to/pretrained_model.pth" pretrained_embedder_path : null pretrained_trunk_path : null input_channels : 3 input_width : 224 input_height : 224 feat_dim : 256 Parameter Datatype Default Description Supported Values backbone string resnet_50 Backbone (trunk) model type. resnet_50, resnet_101, fan_small, fan_base, fan_large, fan_tiny, nvdinov2_vit_large_legacy pretrained_model_path string The path to the pretrained model. The weights are only loaded to the full model pretrained_trunk_path string The path to the pretrained trunk. The weights are only loaded to the trunk part. pretrained_embedder_path string The path to the pretrained embedder. The weights are only loaded to the embedder part. input_channels unsigned int 3 The number of input channels >0 input_width int 224 The input width of the images int input_height int 224 The input height of the images int feat_dim unsigned int 256 The output size of the feature embeddings >0 train# The train parameter defines the hyperparameters of the training process. train : optim : name : Adam steps : [ 40 , 70 ] gamma : 0.1 warmup_factor : 0.01 warmup_iters : 10 warmup_method : 'linear' triplet_loss_margin : 0.3 miner_function_margin : 0.1 embedder : bias_lr_factor : 1 base_lr : 0.000001 momentum : 0.9 weight_decay : 0.0001 weight_decay_bias : 0.0005 trunk : bias_lr_factor : 1 base_lr : 0.00001 momentum : 0.9 weight_decay : 0.0001 weight_decay_bias : 0.0005 num_epochs : 10 checkpoint_interval : 5 validation_interval : 5 clip_grad_norm : 0.0 resume_training_checkpoint_path : null report_accuracy_per_class : True smooth_loss : True batch_size : 64 val_batch_size : 64 train_trunk : false train_embedder : true results_dir : null seed : 1234 Parameter Datatype Default Description Supported Values num_gpus unsigned int 1 The number of GPUs to use for distributed training >0 gpu_ids List[int] [0] The indices of the GPU’s to use for distributed training seed unsigned int 1234 The random seed for random, NumPy, and torch >0 num_epochs unsigned int 10 The total number of epochs to run the experiment >0 checkpoint_interval unsigned int 1 The epoch interval at which the checkpoints are saved >0 validation_interval unsigned int 1 The epoch interval at which the validation is run >0 resume_training_checkpoint_path string The intermediate PyTorch Lightning checkpoint to resume training from results_dir string /results/train The directory to save training results optim dict config – The configuration for the torch optimizer (Optim Config), including the learning rate, learning scheduler, weight decay, etc. clip_grad_norm float 0.0 The amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping. >=0 report_accuracy_per_class bool True If True, the top1 precision of each class will be reported. True/False smooth_loss bool True If True, the log-exp version of the triplet loss will be used. True/False batch_size unsigned int 64 The batch size for training >0 val_batch_size unsigned int 64 The batch size for validation >0 train_trunk bool True If False, the trunk part of the model would be frozen during training True/False train_embedder bool True If False, the embedder part of the model would be frozen during training True/False optim# The optim parameter defines the configuration for the Torch optimizer in training, including the learning rate, learning scheduler, and weight decay. optim : name : Adam steps : [ 40 , 70 ] gamma : 0.1 warmup_factor : 0.01 warmup_iters : 10 warmup_method : 'linear' triplet_loss_margin : 0.3 miner_epsilon : 0.1 embedder : bias_lr_factor : 1 base_lr : 0.00035 momentum : 0.9 weight_decay : 0.0005 weight_decay_bias : 0.0005 trunk : bias_lr_factor : 1 base_lr : 0.00035 momentum : 0.9 weight_decay : 0.0005 weight_decay_bias : 0.0005 Parameter Datatype Default Description Supported Values name string Adam The name of the optimizer. The Algorithms in torch.optim are supported. Adam/SGD/Adamax/… steps int list [40, 70] The steps to decrease the learning rate for the MultiStep scheduler gamma float 0.1 The decay rate for the WarmupMultiStepLR scheduler >0.0 warmup_factor float 0.01 The warmup factor for the WarmupMultiStepLR scheduler >0.0 warmup_iters unsigned int 10 The number of warmup iterations for the WarmupMultiStepLR scheduler >0 warmup_method string linear The warmup method for the optimizer constant/linear triplet_loss_margin float 0.3 The desired difference between the anchor-positive distance and the anchor-negative distance >0.0 miner_function_margin float 0.1 Negative pairs are chosen if they have similarity greater than the hardest positive pair, minus this margin; positive pairs are chosen if they have similarity less than the hardest negative pair, plus the margin >0.0 embedder dict config – The learning rate configurations (LR Config) for the MLRecogNet embedder trunk dict config – The learning rate configurations (LR Config) for MLRecogNet trunk LR Config# Parameter Datatype Default Description Supported Values base_lr float 0.00035 The initial learning rate for the training >0.0 bias_lr_factor float 1 The bias learning rate factor for the WarmupMultiStepLR >=1 momentum float 0.9 The momentum for the WarmupMultiStepLR optimizer >0.0 weight_decay float 0.0005 The weight decay coefficient for the optimizer >0.0 weight_decay_bias float 0.0005 The weight decay bias for the optimizer >0.0 dataset# The dataset parameter defines the dataset source, training batch size, and augmentation. dataset : train_dataset : /path/to/dataset/train val_dataset : reference : /path/to/dataset/reference query : /path/to/dataset/val workers : 8 pixel_mean : [ 0.485 , 0.456 , 0.406 ] pixel_std : [ 0.226 , 0.226 , 0.226 ] padding : 10 prob : 0.5 re_prob : 0.5 sampler : softmax_triplet num_instance : 4 gaussian_blur : enabled : True kernel : [ 15 , 15 ] sigma : [ 0.3 , 0.7 ] color_augmentation : enabled : True brightness : 0.5 contrast : 0.3 saturation : 0.1 hue : 0.1 Parameter Datatype Default Description Supported Values train_dataset string The path to the train dataset. This field is only required for the train task. val_dataset dict The map of reference set and query set addresses. For training and evaluation, both fields are required. For inference, only the reference set address is needed. {“reference”: /path/to/reference/set, “query”: “”} workers unsigned int 8 The number of parallel workers processing data >0 class_map string In tao model , the class_map is a YAML file mapping dataset class names to desired class names. If not specified, by default the reported class names are the folder names in the dataset folder.

In tao deploy , the class_map is a TXT file listing the class names line by line. And the line index would be the class index. By default the class names are the folder names and the order of the classes are alphanumeric. pixel_mean float list [0.485, 0.456, 0.406] The pixel mean for image normalization float list pixel_std float list [0.226, 0.226, 0.226] The pixel standard deviation for image normalization float list num_instance unsigned int 4 The number of image instances of the same person in a batch >0 prob float 0.5 The random horizontal flipping probability for image augmentation >0 re_prob float 0.5 The random erasing probability for image augmentation >0 random_rotation bool True If True, random rotations at 0 ~ 180 degrees to the input data are applied True/False gaussian_blur dict config – The configuration of the Gaussian blur augmentation on input samples color_augmentation dict config – The configuration of the color augmentation on input samples Gaussian Blur Config# Parameter Datatype Default Description Supported Values enabled bool True If True, applies Gaussian blur augmentation to input samples True/False kernel unsigned int list [15, 15] The kernel size for the Gaussian blur sigma float list [0.3, 0.7] The sigma value range for the Gaussian blur Color Augmentation Config# Parameter Datatype Default Description Supported Values enabled bool True If True, applies color augmentation to input samples True/False brightness float 0.5 The value of jittering brightness >=0 contrast float 0.3 The value of jittering contrast >=0 saturation float 0.1 The value of jittering saturation >=0 hue float 0.1 The value of jittering hue >=0, <=0.5

Training the Model# Use the following command to run MLRecogNet training: TAO Client (v2 API) TRAIN_JOB_ID = $( tao metric_learning_recognition create-job \ --kind experiment \ --name "metric_learning_recognition_train" \ --action train \ --workspace-id $WORKSPACE_ID \ --specs " $TRAIN_SPECS " \ --train-datasets '["' $DATASET_ID '"]' \ --eval-dataset " $DATASET_ID " \ --base-experiment-ids '["' $BASE_EXPERIMENT_ID '"]' \ --encryption-key "nvidia_tlt" | jq -r '.id' ) See also For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation. TAO Launcher tao model ml_recog train -e <experiment_spec_file> [ results_dir = <global_results_dir> ] [ model.<model_option> = <model_option_value> ] [ dataset.<dataset_option> = <dataset_option_value> ] [ train.<train_option> = <train_option_value> ] [ train.gpu_ids = <gpu indices> ] [ train.num_gpus = <number of gpus> ] Required Arguments The only required argument is the path to the experiment spec: -e, --experiment_spec : The experiment specification file to set up the training experiment Optional Arguments You can set optional arguments to override the option values in the experiment spec file. -h, --help : Show this help message and exit.

model.<model_option> : The model options.

dataset.<dataset_option> : The dataset options.

train.<train_option> : The train options.

train.optim.<optim_option> : The optimizer options Note For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids , which default to 1 and [0] , respectively. If both are passed, but are inconsistent, for example num_gpus = 1 , gpu_ids = [0, 1] , then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2. In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set this variable: CLI Launcher : You may set the environment variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in ths section Running the launcher. { "Envs" : [ { "variable" : "OMP_NUM_THREADSR" , "value" : "1" } }

Docker: You may set environment variables in Docker by setting the -e flag in the Docker command line. docker run -it --rm --gpus all \ -e OMP_NUM_THREADS = 1 \ -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e Checkpointing and Resuming Training At every train.checkpoint_interval , a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth . Checkpoints are saved in train.results_dir , like this: $ ls /results/train 'model_epoch_000.pth' 'model_epoch_001.pth' 'model_epoch_002.pth' 'model_epoch_003.pth' 'model_epoch_004.pth' The latest checkpoint is also be saved as ml_model_latest.pth . Training automatically resumes from ml_model_latest.pth , if it exists in train.results_dir . This is superseded by train.resume_training_checkpoint_path , if it is provided. The major implication of this logic is that, if you wish to trigger fresh training from scratch, either: Specify a new, empty results directory (Recommended)

Remove the latest checkpoint from the results directory Here’s an example of output $RESULTS_DIR/train/status.json : { "date" : "6/20/2023" , "time" : "23:11:2" , "status" : "STARTED" , "verbosity" : "INFO" , "message" : "Starting Training Loop." } ... { "date" : "6/20/2023" , "time" : "23:11:22" , "status" : "SUCCESS" , "verbosity" : "INFO" , "message" : "Train finished successfully." }

Evaluating the Model# Here is an example spec $EVAL_SPEC for evaluating an MLRecogNet model on a test dataset. results_dir : /path/to/root/results/dir model : backbone : resnet_50 input_width : 224 input_height : 224 feat_dim : 256 dataset : workers : 8 val_dataset : reference : /path/to/dataset/reference query : /path/to/dataset/val evaluate : checkpoint : /path/to/checkpoint batch_size : 128 results_dir : /path/to/results Parameter Datatype Default Description Supported Values checkpoint string None The path to the .pth Torch model to be evaluated results_dir string /results/evaluate The directory to save evaluation results num_gpus unsigned int 1 The number of GPUs to use for distributed evaluation >0 gpu_ids List[int] [0] The indices of the GPU’s to use for distributed evaluation trt_engine string None The path to the TensorRT (TRT) engine to be evaluated. Currently, only trt_engine is supported in TAO Deploy topk int 1 If greater than 1, the accuracy will be top-k precision. Currently, only evaluate.topk is supported in TAO Deploy >0 batch_size int 64 The batch size for the evaluation task >0 report_accuracy_per_class bool True If True, the top-1 precision of each class will be reported True/False The following are evaluation metrics for MLRecogNet: Adjusted Mutual Information (AMI) : A measure used in statistics and information theory to quantify the agreement between two assignments, such as cluster assignments, which is adjusted for chance and therefore provides a more accurate depiction of the similarity between the two compared to raw mutual information.

Normalized Mutual Information (NMI) : A normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation).

Mean Average Precision : The average precision achieved by a model across different recall levels, providing a comprehensive evaluation of its performance on information retrieval.

Mean Average Precision at r : A model’s average precision for the top-R ranked results, offering insight into the effectiveness of the retrieval or object detection performance of the model when considering a limited number of results.

Mean Reciprocal Rank : The average of the inverse ranks of the first relevant result for a set of queries, emphasizing the importance of retrieving relevant information as early as possible.

Precision at 1 : The accuracy of the nearest neighbor retrievals.

EVAL_JOB_ID = $( tao metric_learning_recognition create-job \ --kind experiment \ --name "metric_learning_recognition_evaluate" \ --action evaluate \ --workspace-id $WORKSPACE_ID \ --parent-job-id $TRAIN_JOB_ID \ --eval-dataset " $DATASET_ID " \ --specs " $EVALUATE_SPECS " \ --base-experiment-ids '["' $BASE_EXPERIMENT_ID '"]' \ --encryption-key "nvidia_tlt" | jq -r '.id' )

tao model ml_recog evaluate -e <experiment_spec_file> evaluate.checkpoint = <model to be evaluated> dataset.val_dataset.reference = <path to test reference set> dataset.val_dataset.query = <path to test query set> [ evaluate.<evaluate_option> = <evaluate_option_value> ] [ evaluate.gpu_ids = <gpu indices> ] [ evaluate.num_gpus = <number of gpus> ] Required Arguments The following arguments are required. -e, --experiment_spec_file : The experiment spec file to set up the evaluation experiment

evaluate.checkpoint : The path to the .pth model to be evaluated

dataset.val_dataset.reference : The path to the test reference set

dataset.val_dataset.query : The path to the test query set Optional Arguments evaluate.<evaluate_option> : The evaluate options.

evaluate.checkpoint : The path to the .pth model to be evaluated

dataset.val_dataset.reference : The path to the test reference set

dataset.val_dataset.query : The path to the test query set Optional Arguments evaluate.<evaluate_option> : The evaluate options. Here’s an example of output $RESULTS_DIR/evaluate/status.json : { "date" : "6/2/2023" , "time" : "6:12:16" , "status" : "STARTED" , "verbosity" : "INFO" , "message" : "Starting Metric Learning Recognition evaluate." } { "date" : "6/2/2023" , "time" : "6:12:17" , "status" : "STARTED" , "verbosity" : "INFO" , "message" : "Loading checkpoint: $RESULTS_DIR /train/ml_model_epoch=000.pth" } { "date" : "6/2/2023" , "time" : "6:12:17" , "status" : "RUNNING" , "verbosity" : "INFO" , "message" : "Constructing model graph..." } { "date" : "6/2/2023" , "time" : "6:12:17" , "status" : "SKIPPED" , "verbosity" : "INFO" , "message" : "Skipped loading pretrained model as checkpoint is to load." } { "date" : "6/2/2023" , "time" : "6:12:23" , "status" : "SUCCESS" , "verbosity" : "INFO" , "message" : "Evaluate finished successfully." , "kpi" : { "AMI" : 0 .8074901483322209, "NMI" : 0 .8118350536509751, "Mean Average Precision" : 0 .6876838920302153, "Mean Reciprocal Rank" : 0 .992727267742157, "r-Precision" : 0 .666027864375903, "Precision at Rank 1" : 0 .989090909090909 }} The following is an example of the printouts: Starting Metric Learning Recognition evaluate. Experiment configuration: ... results_dir: $RESULTS_DIR Loading checkpoint: $RESULTS_DIR/train/ml_model_epoch=000.pth Constructing model graph... Skipped loading pretrained model as checkpoint is to load. Evaluating epoch eval mode ... Computing accuracy for the query split w.r.t ['gallery'] running k-nn with k=106 embedding dimensionality is 256 /usr/local/lib/python3.8/dist-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. warnings.warn(message, UserWarning) running k-means clustering with k=5 embedding dimensionality is 256 ******************* Evaluation results ********************** AMI: 0.8075 NMI: 0.8118 Mean Average Precision: 0.7560 Mean Reciprocal Rank: 0.9922 r-Precision: 0.7421 Precision at Rank 1: 0.9882 *************************************************************

Running Inference on the Model# Here is an example spec $INFERENCE_SPEC for running MLRecogNet model inference on an inference set: results_dir : /path/to/root/results/dir model : backbone : resnet_50 input_width : 224 input_height : 224 feat_dim : 256 dataset : workers : 8 val_dataset : reference : /path/to/dataset/reference query : "" inference : input_path : /path/to/dataset/test inference_input_type : classification_folder checkpoint : /path/to/model/checkpoint results_dir : /path/to/results/dir batch_size : 128 Parameter Datatype Default Description Supported Values checkpoint string None The path to the .pth torch model to run inference results_dir string /results/inference The directory to save inference results num_gpus unsigned int 1 The number of GPUs to use for distributed inference >0 gpu_ids List[int] [0] The indices of the GPU’s to use for distributed inference trt_engine string None The path to the TensorRT (TRT) engine to run inference. Currently, only trt_engine is supported in TAO Deploy. input_path string The path to the data to run inference on >0 inference_input_type string “image_folder” Three options are supported: image_folder : Used when input_path is a folder of images.

classification_folder : Used when input_path is an ImageNet structured folder.

INFERENCE_JOB_ID = $( tao metric_learning_recognition create-job \ --kind experiment \ --name "metric_learning_recognition_inference" \ --action inference \ --workspace-id $WORKSPACE_ID \ --parent-job-id $TRAIN_JOB_ID \ --inference-dataset " $DATASET_ID " \ --specs " $INFERENCE_SPECS " \ --base-experiment-ids '["' $BASE_EXPERIMENT_ID '"]' \ --encryption-key "nvidia_tlt" | jq -r '.id' )

tao model ml_recog inference -e <experiment_spec> inference.checkpoint = <inference model> dataset.val_dataset.reference = <path to gallery data> inference.input_path = <path to query data> [ inference.<inference_option> = <inference_option_value> ] [ inference.gpu_ids = <gpu indices> ] [ inference.num_gpus = <number of gpus> ] The output is a CSV file that contains the feature embeddings of all the query data and their predicted labels. Required Arguments The following arguments are required. -e, --experiment_spec : The experiment spec file to set up inference

inference.checkpoint : The .pth model to perform inference with

dataset.val_dataset.reference : The path to the reference set

inference.input_path : The path to the data to run inference on Optional Arguments inference.<inference_option> : The inference options.

inference.checkpoint : The .pth model to perform inference with

dataset.val_dataset.reference : The path to the reference set

inference.input_path : The path to the data to run inference on Optional Arguments inference.<inference_option> : The inference options. The expected output is as follows: /path/to/images/c000001_10.png, "['c000001', 'c000005', 'c000001', 'c000005']" , "[5.0030694183078595e-06, 5.5495906963187736e-06, 5.976316515443614e-06, 6.004379429214168e-06]" /path/to/images/c000001_11.png, "['c000001', 'c000005', 'c000001', 'c000001']" , "[3.968068540416425e-06, 5.043690180173144e-06, 5.885293830942828e-06, 6.030047643434955e-06]" /path/to/images/c000001_120.png, "['c000001', 'c000001', 'c000005', 'c000003']" , "[1.9612791675172048e-06, 4.112744136364199e-06, 4.603011802828405e-06, 5.8091877690458205e-06]" Where the first column contains the inference image paths, the second column the top-k predicted labels, and the third column the embedding vector distances of the top-k results. Here’s an example of output $RESULTS_DIR/inference/status.json : { "date" : "6/2/2023" , "time" : "6:13:47" , "status" : "STARTED" , "verbosity" : "INFO" , "message" : "Starting Metric Learning Recognition inference." } { "date" : "6/2/2023" , "time" : "6:13:47" , "status" : "STARTED" , "verbosity" : "INFO" , "message" : "Loading checkpoint: $RESULTS_DIR /train/ml_model_epoch=001.pth" } { "date" : "6/2/2023" , "time" : "6:13:47" , "status" : "RUNNING" , "verbosity" : "INFO" , "message" : "Constructing model graph..." } { "date" : "6/2/2023" , "time" : "6:13:48" , "status" : "SKIPPED" , "verbosity" : "INFO" , "message" : "Skipped loading pretrained model as checkpoint is to load." } { "date" : "6/2/2023" , "time" : "6:14:6" , "status" : "SUCCESS" , "verbosity" : "INFO" , "message" : "result saved at $RESULTS_DIR /inference/result.csv" } { "date" : "6/2/2023" , "time" : "6:14:6" , "status" : "SUCCESS" , "verbosity" : "INFO" , "message" : "Inference finished successfully." } The following is an example of the printouts: Starting Metric Learning Recognition inference. Experiment configuration: ... Loading checkpoint: $RESULTS_DIR/train/ml_model_epoch=001.pth Constructing model graph... Skipped loading pretrained model as checkpoint is to load. /usr/local/lib/python3.8/dist-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. warnings.warn(message, UserWarning) ... result saved at $RESULTS_DIR/inference/result.csv Inference finished successfully.

EXPORT_JOB_ID = $( tao metric_learning_recognition create-job \ --kind experiment \ --name "metric_learning_recognition_export" \ --action export \ --workspace-id $WORKSPACE_ID \ --parent-job-id $TRAIN_JOB_ID \ --specs " $EXPORT_SPECS " \ --base-experiment-ids '["' $BASE_EXPERIMENT_ID '"]' \ --encryption-key "nvidia_tlt" | jq -r '.id' )

tao model ml_recog export -e <experiment_spec> export.checkpoint = <.pth checkpoint to be exported> [ export.onnx_file = <path to exported ONNX file> ] [ export.<export_option> = <export_option_value> ] Required Arguments The following arguments are required. -e, --experiment_spec : The experiment spec file to set up export

export.checkpoint : The .pth model to be exported Optional Arguments The following arguments are optional to run the command. export.onnx_file : The path to save the exported model to. The default path is in the same directory as the export.results_dir (if any) or results_dir .

export.<export_option> : The export options.

export.checkpoint : The .pth model to be exported Optional Arguments The following arguments are optional to run the command. export.onnx_file : The path to save the exported model to. The default path is in the same directory as the export.results_dir (if any) or results_dir .

export.<export_option> : The export options. Here’s an example of output $RESULTS_DIR/export/status.json : { "date" : "6/2/2023" , "time" : "6:17:45" , "status" : "STARTED" , "verbosity" : "INFO" , "message" : "Starting Metric Learning Recognition export." } { "date" : "6/2/2023" , "time" : "6:17:45" , "status" : "STARTED" , "verbosity" : "INFO" , "message" : "Loading checkpoint: $RESULTS_DIR /train/ml_model_epoch=001.pth" } { "date" : "6/2/2023" , "time" : "6:17:45" , "status" : "RUNNING" , "verbosity" : "INFO" , "message" : "Constructing model graph..." } { "date" : "6/2/2023" , "time" : "6:17:46" , "status" : "SKIPPED" , "verbosity" : "INFO" , "message" : "Skipped loading pretrained model as checkpoint is to load." } { "date" : "6/2/2023" , "time" : "6:17:46" , "status" : "STARTED" , "verbosity" : "INFO" , "message" : "Exporting model to ONNX" } { "date" : "6/2/2023" , "time" : "6:17:48" , "status" : "STARTED" , "verbosity" : "INFO" , "message" : "Simplifying ONNX model" } { "date" : "6/2/2023" , "time" : "6:17:50" , "status" : "SUCCESS" , "verbosity" : "INFO" , "message" : "ONNX model saved at $RESULTS_DIR /export/ml_model_epoch=001.onnx" } { "date" : "6/2/2023" , "time" : "6:17:50" , "status" : "SUCCESS" , "verbosity" : "INFO" , "message" : "Export finished successfully." } The following is an example of the printouts: Starting Metric Learning Recognition export. Experiment configuration: ... Loading checkpoint: $RESULTS_DIR/train/ml_model_epoch=001.pth Constructing model graph... Skipped loading pretrained model as checkpoint is to load. Exporting model to ONNX Exported graph: graph(%input : Float(*, 3, 224, 224, strides=[150528, 50176, 224, 1], requires_grad=0, device=cuda:0), ... ========== Diagnostic Run torch.onnx.export version 1.14.0a0+44dac51 =========== verbose: False, log level: Level.ERROR ======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ======================== Simplifying ONNX model Checking 0/3... Checking 1/3... Checking 2/3... ONNX model saved at $RESULTS_DIR/export/ml_model_epoch=001.onnx Export finished successfully.