Metric Learning Recognition
Metric Learning Recognition (MLRecogNet) is a classifier that encodes the input image to embedding vectors and predicts their labels based on the embedding vectors in the reference space. MLRecogNet consists of two parts:
Trunk: A backbone network that encodes the input image to a feature vector.
Embedder: A fully connected layer that maps the feature vector to the embedding space.
The embedding space is a high-dimensional space where the distance between the embedding vectors of the same class is small and the distance between the embedding vectors of different classes is large. The embedder is trained to minimize the distance between the embedding vectors of the same class and maximize the distance between the embedding vectors of different classes. The embedding vectors of the query images are compared with the embedding vectors of the reference images to predict the labels of the query images.
The current supported trunk is ResNet, which is the most commonly used baseline for vision classification. And the current supported embedder is a one-layer MLP.
During training, evaluation, and inference, MLRecogNet requires a reference set and a query set for validation or test. The reference set consists of a collection of labeled images, while the query set refers to a group of unlabeled images–the goal is to predict the labels of the unlabeled images by comparing their similarity to the embedding vectors of the reference set generated by trained MLRecogNet.
MLRecogNet requires cropped images from the detection set or classification set as input. These images are resized to 224x224 by default for model input. Augmentation is applied to each image during training.
The data should be organized in the following structure:
/Dataset
/reference
/class1
0001.jpg
0002.jpg
...
0100.jpg
/class2
0001.jpg
0002.jpg
...
0100.jpg
...
/train
/class1
0101.jpg
0102.jpg
...
0200.jpg
/class2
0101.jpg
0102.jpg
...
0200.jpg
/val
/class1
0201.jpg
0202.jpg
...
0220.jpg
/class2
0201.jpg
0202.jpg
...
0220.jpg
/test
/class1
0301.jpg
0302.jpg
...
0400.jpg
/class2
0301.jpg
0302.jpg
...
0400.jpg
The root directory of the dataset contains sub-directories for reference, training, validation, and test.
The sub-directories are required to be in ImageNet
structure, as demonstrated above. Each sub-directory
has images of the same class. If the classes in test
set are not in the reference
set, the queried
images cannot be correctly recognized.
The spec file for MLRecogNet includes model
, train
, and dataset
parameters. Here is an
example spec $TRAIN_SPEC
for training a MLRecogNet model on a target dataset.
results_dir: "???"
model:
backbone: resnet_101
pretrained_model_path: /path/to/resnet101_pretrained_mlrecog.pth.tar
input_width: 224
input_height: 224
feat_dim: 2048
train:
optim:
name: Adam
steps: [40, 70]
gamma: 0.1
embedder:
bias_lr_factor: 1
weight_decay: 0.0001
weight_decay_bias: 0.0005
base_lr: 0.000001
momentum: 0.9
trunk:
bias_lr_factor: 1
weight_decay: 0.0001
weight_decay_bias: 0.0005
base_lr: 0.00001
momentum: 0.9
warmup_factor: 0.01
warmup_iters: 10
warmup_method: linear
triplet_loss_margin: 0.3
miner_function_margin: 0.1
num_epochs: 10
resume_training_checkpoint_path: null
checkpoint_interval: 5
validation_interval: 5
smooth_loss: False
batch_size: 16
val_batch_size: 16
seed: 1234
dataset:
train_dataset: /path/to/dataset/train
val_dataset:
reference: /path/to/dataset/reference
query: /path/to/dataset/val
workers: 12
pixel_mean: [0.485, 0.456, 0.406]
pixel_std: [0.226, 0.226, 0.226]
prob: 0.5
re_prob: 0.5
num_instance: 4
color_augmentation:
enabled: True
brightness: 0.5
contrast: 0.3
saturation: 0.1
hue: 0.1
gaussian_blur:
enabled: True
kernel: [15, 15]
sigma: [0.3, 0.7]
random_rotation: True
class_map: /path/to/class_map.yaml
Parameter | Data Type | Default | Description | Supported Values |
model |
dict config | – | The configuration of the model architecture | |
dataset |
dict config | – | The configuration of the dataset | |
train |
dict config | – | The configuration of the training task | |
evaluate |
dict config | – | The configuration of the evaluation task | |
inference |
dict config | – | The configuration of the inference task | |
encryption_key |
string | None | The encryption key to encrypt and decrypt model files | |
results_dir |
string | /results | The directory where experiment results are saved | |
export |
dict config | – | The configuration of the ONNX export task | |
gen_trt_engine |
dict config | – | The configuration of the TensorRT generation task. Only used in tao deploy |
model
The model
parameter provides options to change the MetricLearningRecognition architecture.
model:
backbone: resnet_50
pretrained_model_path: "/path/to/pretrained_model.pth"
pretrained_embedder_path: null
pretrained_trunk_path: null
input_channels: 3
input_width: 224
input_height: 224
feat_dim: 256
Parameter | Datatype | Default | Description | Supported Values |
backbone |
string | resnet_50 | Backbone (trunk) model type. | resnet_50, resnet_101, fan_small, fan_base, fan_large, fan_tiny, nvdinov2_vit_large_legacy |
pretrained_model_path |
string | The path to the pretrained model. The weights are only loaded to the full model | ||
pretrained_trunk_path |
string | The path to the pretrained trunk. The weights are only loaded to the trunk part. | ||
pretrained_embedder_path |
string | The path to the pretrained embedder. The weights are only loaded to the embedder part. | ||
input_channels |
unsigned int | 3 | The number of input channels | >0 |
input_width |
int | 224 | The input width of the images | int |
input_height |
int | 224 | The input height of the images | int |
feat_dim |
unsigned int | 256 | The output size of the feature embeddings | >0 |
train
The train
parameter defines the hyperparameters of the training process.
train:
optim:
name: Adam
steps: [40, 70]
gamma: 0.1
warmup_factor: 0.01
warmup_iters: 10
warmup_method: 'linear'
triplet_loss_margin: 0.3
miner_function_margin: 0.1
embedder:
bias_lr_factor: 1
base_lr: 0.000001
momentum: 0.9
weight_decay: 0.0001
weight_decay_bias: 0.0005
trunk:
bias_lr_factor: 1
base_lr: 0.00001
momentum: 0.9
weight_decay: 0.0001
weight_decay_bias: 0.0005
num_epochs: 10
checkpoint_interval: 5
validation_interval: 5
clip_grad_norm: 0.0
resume_training_checkpoint_path: null
report_accuracy_per_class: True
smooth_loss: True
batch_size: 64
val_batch_size: 64
train_trunk: false
train_embedder: true
results_dir: null
seed: 1234
Parameter | Datatype | Default | Description | Supported Values |
num_gpus |
unsigned int | 1 | The number of GPUs to use for distributed training | >0 |
gpu_ids |
List[int] | [0] | The indices of the GPU’s to use for distributed training | |
seed |
unsigned int | 1234 | The random seed for random, NumPy, and torch | >0 |
num_epochs |
unsigned int | 10 | The total number of epochs to run the experiment | >0 |
checkpoint_interval |
unsigned int | 1 | The epoch interval at which the checkpoints are saved | >0 |
validation_interval |
unsigned int | 1 | The epoch interval at which the validation is run | >0 |
resume_training_checkpoint_path |
string | The intermediate PyTorch Lightning checkpoint to resume training from | ||
results_dir |
string | /results/train | The directory to save training results | |
optim |
dict config | – | The configuration for the torch optimizer (Optim Config), including the learning rate, learning scheduler, weight decay, etc. | |
clip_grad_norm |
float | 0.0 | The amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping. | >=0 |
report_accuracy_per_class |
bool | True | If True, the top1 precision of each class will be reported. | True/False |
smooth_loss |
bool | True | If True, the log-exp version of the triplet loss will be used. | True/False |
batch_size |
unsigned int | 64 | The batch size for training | >0 |
val_batch_size |
unsigned int | 64 | The batch size for validation | >0 |
train_trunk |
bool | True | If False, the trunk part of the model would be frozen during training | True/False |
train_embedder |
bool | True | If False, the embedder part of the model would be frozen during training | True/False |
optim
The optim
parameter defines the configuration for the Torch optimizer in training, including the
learning rate, learning scheduler, and weight decay.
optim:
name: Adam
steps: [40, 70]
gamma: 0.1
warmup_factor: 0.01
warmup_iters: 10
warmup_method: 'linear'
triplet_loss_margin: 0.3
miner_epsilon: 0.1
embedder:
bias_lr_factor: 1
base_lr: 0.00035
momentum: 0.9
weight_decay: 0.0005
weight_decay_bias: 0.0005
trunk:
bias_lr_factor: 1
base_lr: 0.00035
momentum: 0.9
weight_decay: 0.0005
weight_decay_bias: 0.0005
Parameter |
Datatype |
Default |
Description |
Supported Values |
---|---|---|---|---|
name |
string | Adam | The name of the optimizer. The Algorithms in torch.optim are supported. |
Adam/SGD/Adamax/… |
steps |
int list | [40, 70] | The steps to decrease the learning rate for the MultiStep scheduler |
|
gamma |
float | 0.1 | The decay rate for the WarmupMultiStepLR scheduler |
>0.0 |
warmup_factor |
float | 0.01 | The warmup factor for the WarmupMultiStepLR scheduler |
>0.0 |
warmup_iters |
unsigned int | 10 | The number of warmup iterations for the WarmupMultiStepLR scheduler |
>0 |
warmup_method |
string | linear | The warmup method for the optimizer | constant/linear |
triplet_loss_margin |
float | 0.3 | The desired difference between the anchor-positive distance and the anchor-negative distance | >0.0 |
miner_function_margin |
float | 0.1 | Negative pairs are chosen if they have similarity greater than the hardest positive pair, minus this margin; positive pairs are chosen if they have similarity less than the hardest negative pair, plus the margin | >0.0 |
embedder |
dict config | – | The learning rate configurations (LR Config) for the MLRecogNet embedder | |
trunk |
dict config | – | The learning rate configurations (LR Config) for MLRecogNet trunk |
LR Config
Parameter |
Datatype |
Default |
Description |
Supported Values |
---|---|---|---|---|
base_lr |
float | 0.00035 | The initial learning rate for the training | >0.0 |
bias_lr_factor |
float | 1 | The bias learning rate factor for the WarmupMultiStepLR | >=1 |
momentum |
float | 0.9 | The momentum for the WarmupMultiStepLR optimizer | >0.0 |
weight_decay |
float | 0.0005 | The weight decay coefficient for the optimizer | >0.0 |
weight_decay_bias |
float | 0.0005 | The weight decay bias for the optimizer | >0.0 |
dataset
The dataset
parameter defines the dataset source, training batch size, and augmentation.
dataset:
train_dataset: /path/to/dataset/train
val_dataset:
reference: /path/to/dataset/reference
query: /path/to/dataset/val
workers: 8
pixel_mean: [0.485, 0.456, 0.406]
pixel_std: [0.226, 0.226, 0.226]
padding: 10
prob: 0.5
re_prob: 0.5
sampler: softmax_triplet
num_instance: 4
gaussian_blur:
enabled: True
kernel: [15, 15]
sigma: [0.3, 0.7]
color_augmentation:
enabled: True
brightness: 0.5
contrast: 0.3
saturation: 0.1
hue: 0.1
Parameter | Datatype | Default | Description | Supported Values |
train_dataset |
string | The path to the train dataset. This field is only required for the train task. | ||
val_dataset |
dict | The map of reference set and query set addresses. For training and evaluation, both fields are required. For inference, only the reference set address is needed. | {“reference”: /path/to/reference/set, “query”: “”} | |
workers |
unsigned int | 8 | The number of parallel workers processing data | >0 |
class_map |
string |
|
||
pixel_mean |
float list | [0.485, 0.456, 0.406] | The pixel mean for image normalization | float list |
pixel_std |
float list | [0.226, 0.226, 0.226] | The pixel standard deviation for image normalization | float list |
num_instance |
unsigned int | 4 | The number of image instances of the same person in a batch | >0 |
prob |
float | 0.5 | The random horizontal flipping probability for image augmentation | >0 |
re_prob |
float | 0.5 | The random erasing probability for image augmentation | >0 |
random_rotation |
bool | True | If True, random rotations at 0 ~ 180 degrees to the input data are applied | True/False |
gaussian_blur |
dict config | – | The configuration of the Gaussian blur augmentation on input samples | |
color_augmentation |
dict config | – | The configuration of the color augmentation on input samples |
Gaussian Blur Config
Parameter |
Datatype |
Default |
Description |
Supported Values |
---|---|---|---|---|
enabled |
bool | True | If True, applies Gaussian blur augmentation to input samples | True/False |
kernel |
unsigned int list | [15, 15] | The kernel size for the Gaussian blur | |
sigma |
float list | [0.3, 0.7] | The sigma value range for the Gaussian blur |
Color Augmentation Config
Parameter |
Datatype |
Default |
Description |
Supported Values |
---|---|---|---|---|
enabled |
bool | True | If True, applies color augmentation to input samples | True/False |
brightness |
float | 0.5 | The value of jittering brightness | >=0 |
contrast |
float | 0.3 | The value of jittering contrast | >=0 |
saturation |
float | 0.1 | The value of jittering saturation | >=0 |
hue |
float | 0.1 | The value of jittering hue | >=0, <=0.5 |
Use the following command to run MLRecogNet training:
tao model ml_recog train -e <experiment_spec_file>
[results_dir=<global_results_dir>]
[model.<model_option>=<model_option_value>]
[dataset.<dataset_option>=<dataset_option_value>]
[train.<train_option>=<train_option_value>]
[train.gpu_ids=<gpu indices>]
[train.num_gpus=<number of gpus>]
Required Arguments
The only required argument is the path to the experiment spec:
-e, --experiment_spec
: The experiment specification file to set up the training experiment
Optional Arguments
You can set optional arguments to override the option values in the experiment spec file.
-h, --help
: Show this help message and exit.model.<model_option>
: The model options.dataset.<dataset_option>
: The dataset options.train.<train_option>
: The train options.train.optim.<optim_option>
: The optimizer options
For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus
and gpu_ids
, which
default to 1
and [0]
, respectively. If both are passed, but inconsistent, for example num_gpus = 1
,
gpu_ids = [0, 1]
, then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2
.
Checkpointing and Resuming Training
At every train.checkpoint_interval
, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth
.
These are saved in train.results_dir
, like so:
$ ls /results/train
'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'
The latest checkpoint is also be saved as ml_model_latest.pth
.
Training automatically resumes from ml_model_latest.pth
, if it exists in train.results_dir
.
This is superseded by train.resume_training_checkpoint_path
, if it is provided.
The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:
Specify a new, empty results directory (Recommended)
Remove the latest checkpoint from the results directory
Here’s an example of output $RESULTS_DIR/train/status.json
:
{"date": "6/20/2023", "time": "23:11:2", "status": "STARTED", "verbosity": "INFO", "message": "Starting Training Loop."}
...
{"date": "6/20/2023", "time": "23:11:22", "status": "SUCCESS", "verbosity": "INFO", "message": "Train finished successfully."}
Here is an example spec $EVAL_SPEC
for evaluating an MLRecogNet model on a test dataset.
results_dir: /path/to/root/results/dir
model:
backbone: resnet_50
input_width: 224
input_height: 224
feat_dim: 256
dataset:
workers: 8
val_dataset:
reference: /path/to/dataset/reference
query: /path/to/dataset/val
evaluate:
checkpoint: /path/to/checkpoint
batch_size: 128
results_dir: /path/to/results
Parameter | Datatype | Default | Description | Supported Values |
checkpoint |
string | None | The path to the .pth Torch model to be evaluated | |
results_dir |
string | /results/evaluate | The directory to save evaluation results | |
num_gpus |
unsigned int | 1 | The number of GPUs to use for distributed evaluation | >0 |
gpu_ids |
List[int] | [0] | The indices of the GPU’s to use for distributed evaluation | |
trt_engine |
string | None | The path to the TensorRT (TRT) engine to be evaluated. Currently, only trt_engine is supported in TAO Deploy |
|
topk |
int | 1 | If greater than 1, the accuracy will be top-k precision. Currently, only evaluate.topk is supported in TAO Deploy |
>0 |
batch_size |
int | 64 | The batch size for the evaluation task | >0 |
report_accuracy_per_class |
bool | True | If True, the top-1 precision of each class will be reported | True/False |
The following are evaluation metrics for MLRecogNet:
Adjusted Mutual Information (AMI)
: A measure used in statistics and information theory to quantify the agreement between two assignments, such as cluster assignments, which is adjusted for chance and therefore provides a more accurate depiction of the similarity between the two compared to raw mutual information.Normalized Mutual Information (NMI)
: A normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation).Mean Average Precision
: The average precision achieved by a model across different recall levels, providing a comprehensive evaluation of its performance on information retrieval.Mean Average Precision at r
: A model’s average precision for the top-R ranked results, offering insight into the effectiveness of the retrieval or object detection performance of the model when considering a limited number of results.Mean Reciprocal Rank
: The average of the inverse ranks of the first relevant result for a set of queries, emphasizing the importance of retrieving relevant information as early as possible.Precision at 1
: The accuracy of the nearest neighbor retrievals.R Precision
: An evaluation metric for information retrieval systems that measures the proportion of relevant documents among the top-R ranked results, where “”R corresponds to the total number of relevant documents for a given query.
When evaluate.report_accuracy_per_class
is set to True, the accuracy of each class is added.
Use the following command to run MLRecogNet evaluation:
tao model ml_recog evaluate -e <experiment_spec_file>
evaluate.checkpoint=<model to be evaluated>
dataset.val_dataset.reference=<path to test reference set>
dataset.val_dataset.query=<path to test query set>
[evaluate.<evaluate_option>=<evaluate_option_value>]
[evaluate.gpu_ids=<gpu indices>]
[evaluate.num_gpus=<number of gpus>]
Required Arguments
-e, --experiment_spec_file
: The experiment spec file to set up the evaluation experimentevaluate.checkpoint
: The path to the.pth
model to be evaluateddataset.val_dataset.reference
: The path to the test reference setdataset.val_dataset.query
: The path to the test query set
Optional Argument
evaluate.<evaluate_option>
: The evaluate options.
Here’s an example of output $RESULTS_DIR/evaluate/status.json
:
{"date": "6/2/2023", "time": "6:12:16", "status": "STARTED", "verbosity": "INFO", "message": "Starting Metric Learning Recognition evaluate."}
{"date": "6/2/2023", "time": "6:12:17", "status": "STARTED", "verbosity": "INFO", "message": "Loading checkpoint:$RESULTS_DIR/train/ml_model_epoch=000.pth"}
{"date": "6/2/2023", "time": "6:12:17", "status": "RUNNING", "verbosity": "INFO", "message": "Constructing model graph..."}
{"date": "6/2/2023", "time": "6:12:17", "status": "SKIPPED", "verbosity": "INFO", "message": "Skipped loading pretrained model as checkpoint is to load."}
{"date": "6/2/2023", "time": "6:12:23", "status": "SUCCESS", "verbosity": "INFO", "message": "Evaluate finished successfully.", "kpi": {"AMI": 0.8074901483322209, "NMI": 0.8118350536509751, "Mean Average Precision": 0.6876838920302153, "Mean Reciprocal Rank": 0.992727267742157, "r-Precision": 0.666027864375903, "Precision at Rank 1": 0.989090909090909}}
The following is an example of the printouts:
Starting Metric Learning Recognition evaluate.
Experiment configuration:
...
results_dir: $RESULTS_DIR
Loading checkpoint: $RESULTS_DIR/train/ml_model_epoch=000.pth
Constructing model graph...
Skipped loading pretrained model as checkpoint is to load.
Evaluating epoch eval mode
...
Computing accuracy for the query split w.r.t ['gallery']
running k-nn with k=106
embedding dimensionality is 256
/usr/local/lib/python3.8/dist-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
warnings.warn(message, UserWarning)
running k-means clustering with k=5
embedding dimensionality is 256
******************* Evaluation results **********************
AMI: 0.8075
NMI: 0.8118
Mean Average Precision: 0.7560
Mean Reciprocal Rank: 0.9922
r-Precision: 0.7421
Precision at Rank 1: 0.9882
*************************************************************
Here is an example spec $INFERENCE_SPEC
for running MLRecogNet model inference on an inference set:
results_dir: /path/to/root/results/dir
model:
backbone: resnet_50
input_width: 224
input_height: 224
feat_dim: 256
dataset:
workers: 8
val_dataset:
reference: /path/to/dataset/reference
query: ""
inference:
input_path: /path/to/dataset/test
inference_input_type: classification_folder
checkpoint: /path/to/model/checkpoint
results_dir: /path/to/results/dir
batch_size: 128
Parameter | Datatype | Default | Description | Supported Values |
checkpoint |
string | None | The path to the .pth torch model to run inference | |
results_dir |
string | /results/inference | The directory to save inference results | |
num_gpus |
unsigned int | 1 | The number of GPUs to use for distributed inference | >0 |
gpu_ids |
List[int] | [0] | The indices of the GPU’s to use for distributed inference | |
trt_engine |
string | None | The path to the TensorRT (TRT) engine to run inference. Currently, only trt_engine is supported in TAO Deploy. |
|
input_path |
string | The path to the data to run inference on | >0 | |
inference_input_type |
string | “image_folder” | Three options are supported:
|
“image_folder”/”classification_folder”/”image” |
batch_size |
int | 64 | The batch size for the inference task | >0 |
topk |
int | 1 | The number of top results to be returned | >0 |
Use the following command to run inference on MLRecogNet with the .pth
model:
tao model ml_recog inference -e <experiment_spec>
inference.checkpoint=<inference model>
dataset.val_dataset.reference=<path to gallery data>
inference.input_path=<path to query data>
[inference.<inference_option>=<inference_option_value>]
[inference.gpu_ids=<gpu indices>]
[inference.num_gpus=<number of gpus>]
The output is a CSV file that contains the feature embeddings of all the query data and their predicted labels.
Required Arguments
-e, --experiment_spec
: The experiment spec file to set up inferenceinference.checkpoint
: The.pth
model to perform inference withdataset.val_dataset.reference
: The path to the reference setinference.input_path
: The path to the data to run inference on
Optional Argument
inference.<inference_option>
: The inference options.
The expected output is as follows:
/path/to/images/c000001_10.png,"['c000001', 'c000005', 'c000001', 'c000005']","[5.0030694183078595e-06, 5.5495906963187736e-06, 5.976316515443614e-06, 6.004379429214168e-06]"
/path/to/images/c000001_11.png,"['c000001', 'c000005', 'c000001', 'c000001']","[3.968068540416425e-06, 5.043690180173144e-06, 5.885293830942828e-06, 6.030047643434955e-06]"
/path/to/images/c000001_120.png,"['c000001', 'c000001', 'c000005', 'c000003']","[1.9612791675172048e-06, 4.112744136364199e-06, 4.603011802828405e-06, 5.8091877690458205e-06]"
Where the first column contains the inference image paths, the second column the top-k predicted labels, and the third column the embedding vector distances of the top-k results.
Here’s an example of output $RESULTS_DIR/inference/status.json
:
{"date": "6/2/2023", "time": "6:13:47", "status": "STARTED", "verbosity": "INFO", "message": "Starting Metric Learning Recognition inference."}
{"date": "6/2/2023", "time": "6:13:47", "status": "STARTED", "verbosity": "INFO", "message": "Loading checkpoint:$RESULTS_DIR/train/ml_model_epoch=001.pth"}
{"date": "6/2/2023", "time": "6:13:47", "status": "RUNNING", "verbosity": "INFO", "message": "Constructing model graph..."}
{"date": "6/2/2023", "time": "6:13:48", "status": "SKIPPED", "verbosity": "INFO", "message": "Skipped loading pretrained model as checkpoint is to load."}
{"date": "6/2/2023", "time": "6:14:6", "status": "SUCCESS", "verbosity": "INFO", "message": "result saved at$RESULTS_DIR/inference/result.csv"}
{"date": "6/2/2023", "time": "6:14:6", "status": "SUCCESS", "verbosity": "INFO", "message": "Inference finished successfully."}
The following is an example of the printouts:
Starting Metric Learning Recognition inference.
Experiment configuration:
...
Loading checkpoint: $RESULTS_DIR/train/ml_model_epoch=001.pth
Constructing model graph...
Skipped loading pretrained model as checkpoint is to load.
/usr/local/lib/python3.8/dist-packages/torch/storage.py:315: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.
warnings.warn(message, UserWarning)
...
result saved at $RESULTS_DIR/inference/result.csv
Inference finished successfully.
Here is an example spec $EXPORT_SPEC
for exporting the MLRecogNet model:
results_dir: /path/to/root/results/dir
model:
backbone: resnet_50
input_width: 224
input_height: 224
feat_dim: 256
export:
checkpoint: /path/to/checkpoint
onnx_file: /path/to/results/model.onnx
results_dir: /path/to/results
batch_size: -1
on_cpu: false
verbose: true
Parameter | Datatype | Default | Description | Supported Values |
checkpoint |
string | None | the path to the .pth Torch model to be evaluated | |
onnx_file |
string | None | The path to the exported ONNX file. If this value is not specified, it defaults to model.onnx in export.results_dir |
|
batch_size |
int | -1 | The batch size of the exported ONNX model. If batch_size is -1, the exported ONNX model has a dynamic batch size. |
>0; -1 |
gpu_id |
unsigned int | 0 | The GPU ID for Torch-to-ONNX export. Currently, the export task only supports running on a single GPU | >=0 |
on_cpu |
bool | False | If True, the Torch-to-ONNX export will be performed on CPU | True/False |
opset_version |
unsigned int | 14 | The version of the default (ai.onnx) opset to target | >= 7 and <= 16. |
verbose |
bool | True | If True, prints a description of the model being exported to stdout . |
True/False |
results_dir |
string | None | The path to the results directory of the export task |
Use the following command to export MLRecogNet to the .onnx
format for deployment:
tao model ml_recog export -e <experiment_spec>
export.checkpoint=<.pth checkpoint to be exported>
[export.onnx_file=<path to exported ONNX file>]
[export.<export_option>=<export_option_value>]
Required Arguments
-e, --experiment_spec
: The experiment spec file to set up exportexport.checkpoint
: The.pth
model to be exported
Optional Arguments
export.onnx_file
: The path to save the exported model to. The default path is in the same directory as theexport.results_dir
(if any) orresults_dir
.export.<export_option>
: The export options.
Here’s an example of output $RESULTS_DIR/export/status.json
:
{"date": "6/2/2023", "time": "6:17:45", "status": "STARTED", "verbosity": "INFO", "message": "Starting Metric Learning Recognition export."}
{"date": "6/2/2023", "time": "6:17:45", "status": "STARTED", "verbosity": "INFO", "message": "Loading checkpoint:$RESULTS_DIR/train/ml_model_epoch=001.pth"}
{"date": "6/2/2023", "time": "6:17:45", "status": "RUNNING", "verbosity": "INFO", "message": "Constructing model graph..."}
{"date": "6/2/2023", "time": "6:17:46", "status": "SKIPPED", "verbosity": "INFO", "message": "Skipped loading pretrained model as checkpoint is to load."}
{"date": "6/2/2023", "time": "6:17:46", "status": "STARTED", "verbosity": "INFO", "message": "Exporting model to ONNX"}
{"date": "6/2/2023", "time": "6:17:48", "status": "STARTED", "verbosity": "INFO", "message": "Simplifying ONNX model"}
{"date": "6/2/2023", "time": "6:17:50", "status": "SUCCESS", "verbosity": "INFO", "message": "ONNX model saved at$RESULTS_DIR/export/ml_model_epoch=001.onnx"}
{"date": "6/2/2023", "time": "6:17:50", "status": "SUCCESS", "verbosity": "INFO", "message": "Export finished successfully."}
The following is an example of the printouts:
Starting Metric Learning Recognition export.
Experiment configuration:
...
Loading checkpoint: $RESULTS_DIR/train/ml_model_epoch=001.pth
Constructing model graph...
Skipped loading pretrained model as checkpoint is to load.
Exporting model to ONNX
Exported graph: graph(%input : Float(*, 3, 224, 224, strides=[150528, 50176, 224, 1], requires_grad=0, device=cuda:0),
...
========== Diagnostic Run torch.onnx.export version 1.14.0a0+44dac51 ===========
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
Simplifying ONNX model
Checking 0/3...
Checking 1/3...
Checking 2/3...
ONNX model saved at $RESULTS_DIR/export/ml_model_epoch=001.onnx
Export finished successfully.
You can use TAO Deploy to deploy the trained deep-learning and computer-vision models on edge devices–such as a Jetson Xavier, Jetson Nano, or Tesla–or in the cloud with NVIDIA GPUs. TAO Deploy an application in TAO that converts an ONNX model to a TensorRT engine and runs inferences through the TensorRT engine.
Running MLRecogNet Inference on TAO Deploy
The MLRecogNet ONNX file generated from export is taken as input to TAO Deploy to generate an optimized TensorRT engine. For more information about using TAO Deploy to run inference on an MLRecogNet TensorRT engine, refer to the TAO Deploy documentation.