Mask Auto Labeler#
Mask Auto Labeler (MAL) is a high-quality, transformer-based mask auto-labeling framework for instance segmentation using only box annotations. It supports the following tasks:
train
evaluate
inference
These tasks may be invoked from the TAO Launcher using the following convention on the command line:
tao mal <sub_task> <args_per_subtask>
Where args_per_subtask
are the command-line arguments required for a given subtask. Each of
these subtasks are explained in detail below.
Creating a Configuration File#
Below is a sample MAL spec file. It has five components–model
, inference
,
evaluate
, dataset
, and train
–as well as several global parameters,
which are described below. The format of the spec file is a YAML file.
strategy: 'fsdp'
results_dir: '/path/to/result/dir'
dataset:
train_ann_path: '/datasets/coco/annotations/instances_train2017.json'
train_img_dir: '/datasets/coco/raw-data/train2017'
val_ann_path: '/coco/annotations/instances_val2017.json'
val_img_dir: '/datasets/coco/raw-data/val2017'
load_mask: True
crop_size: 512
inference:
ann_path: '/dataset/sample.json'
img_dir: '/dataset/sample_dir'
label_dump_path: '/dataset/sample_output.json'
model:
arch: 'vit-mae-base/16'
train:
num_epochs: 10
checkpoint_interval: 5
validation_interval: 5
batch_size: 4
seed: 1234
num_gpus: 1
gpu_ids: [0]
use_amp: True
optim_momentum: 0.9
lr: 0.0000015
min_lr_rate: 0.2
wd: 0.0005
warmup_epochs: 1
crf_kernel_size: 3
crf_num_iter: 100
loss_mil_weight: 4
loss_crf_weight: 0.5
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
dict config |
– |
The configuration of the model architecture |
|
|
dict config |
– |
The configuration of the dataset |
|
|
dict config |
– |
The configuration of the training task |
|
|
dict config |
– |
The configuration of the evaluation task |
|
|
dict config |
– |
The configuration of the inference task |
|
|
string |
None |
The encryption key to encrypt and decrypt model files |
|
|
string |
/results |
The directory where experiment results are saved |
|
|
string |
‘ddp’ |
The distributed training strategy |
‘ddp’, ‘fsdp’ |
Dataset Config#
The dataset configuration (dataset
) defines the data source and input size.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
– |
The path to the training annotation JSON file |
|
|
string |
– |
The path to the validation annotation JSON file |
|
|
string |
– |
The path to the training image directory |
|
|
string |
– |
The path to the validation annotation JSON file |
|
|
Unsigned int |
512 |
The effective input size of the model |
|
|
boolean |
True |
A flag specifying whether to load the segmentation mask from the JSON file |
|
|
float |
2048 |
The minimum object size for training |
|
|
float |
1e10 |
The maximum object size for training |
|
|
Unsigned int |
The number of workers to load data for each GPU |
Model Config#
The model configuration (model
) defines the model architecture.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
vit-mae-base/16 |
The backbone architecture Supported backbones include the following:
|
|
|
List[int] |
[-1] |
The indices of the frozen blocks |
|
|
Unsigned int |
4 |
The number of conv layers in the mask head |
|
|
Unsigned int |
256 |
The number of conv channels in the mask head |
|
|
Unsigned int |
256 |
The number of output channels in the mask head |
|
|
float |
0.996 |
The momentum of the teacher model |
Train Config#
The training configuration (train
) specifies the parameters for the training process.
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
unsigned int |
1 |
The number of GPUs to use for distributed training |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed training |
|
|
unsigned int |
1234 |
The random seed for random, numpy, and torch |
>0 |
|
unsigned int |
10 |
The total number of epochs to run the experiment |
>0 |
|
unsigned int |
1 |
The epoch interval at which the checkpoints are saved |
>0 |
|
unsigned int |
1 |
The epoch interval at which the validation is run |
>0 |
|
string |
The intermediate PyTorch Lightning checkpoint to resume training from |
||
|
string |
/results/train |
The directory to save training results |
|
|
Unsigned int |
The training batch size |
||
|
boolean |
True |
A flag specifying whether to use mixed precision |
|
|
float |
0.9 |
The momentum of the AdamW optimizer |
|
|
float |
0.0000015 |
The learning rate |
|
|
float |
0.2 |
The minimum learning rate ratio |
|
|
float |
0.0005 |
The weight decay |
|
|
Unsigned int |
1 |
The number of epochs for warmup |
|
|
Unsigned int |
3 |
The kernel size of the mean field approximation |
|
|
Unsigned int |
100 |
The number of iterations to run mask refinement |
|
|
float |
4 |
The weight of multiple instance learning loss |
|
|
float |
0.5 |
The weight of conditional random field loss |
Evaluation Config#
The evaluation configuration (evaluate
) specifies the parameters for the validation during training as well as the standalone evaluation.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
Path to PyTorch model to evaluate |
||
|
string |
/results/evaluate |
The directory to save evaluation results |
|
|
unsigned int |
1 |
The number of GPUs to use for distributed evaluation |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed evaluation |
|
|
Unsigned int |
The evaluation batch size |
||
|
boolean |
False |
A flag specifying whether to evaluate with a mixed model |
|
|
boolean |
False |
A flag specifying whether to evaluate with the teacher model |
Inference Config#
The inference configuration (inference
) specifies the parameters for generating pseudo masks given the groundtruth bounding boxes in COCO format.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
Path to PyTorch model to inference |
||
|
string |
/results/inference |
The directory to save inference results |
|
|
unsigned int |
1 |
The number of GPUs to use for distributed inference |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed inference |
|
|
string |
The path to the annotation JSON file |
||
|
string |
The image directory |
||
|
string |
The path to save the output JSON file with pseudo masks |
||
|
Unsigned int |
The inference batch size |
||
|
boolean |
False |
A flag specifying whether to load masks if the annotation file has them |
Training the Model#
Use the following command to run MAL training:
tao model mal train [-h] -e <experiment_spec>
[results_dir=<global_results_dir>]
[model.<model_option>=<model_option_value>]
[dataset.<dataset_option>=<dataset_option_value>]
[train.<train_option>=<train_option_value>]
[train.gpu_ids=<gpu indices>]
[train.num_gpus=<number of gpus>]
Required Arguments#
The only required argument is the path to the experiment spec:
-e, --experiment_spec
: The experiment specification file to set up the training experiment
Optional Arguments#
You can set optional arguments to override the option values in the experiment spec file.
-h, --help
: Show this help message and exit.model.<model_option>
: The model options.dataset.<dataset_option>
: The dataset options.train.<train_option>
: The train options.
Note
For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus
and gpu_ids
, which
default to 1
and [0]
, respectively. If both are passed, but inconsistent, for example num_gpus = 1
,
gpu_ids = [0, 1]
, then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2
.
Checkpointing and Resuming Training#
At every train.checkpoint_interval
, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth
.
These are saved in train.results_dir
, like so:
$ ls /results/train
'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'
The latest checkpoint is also saved as mal_model_latest.pth
.
Training automatically resumes from mal_model_latest.pth
, if it exists in train.results_dir
.
This is superseded by train.resume_training_checkpoint_path
, if it is provided.
The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:
Specify a new, empty results directory (Recommended)
Remove the latest checkpoint from the results directory
Evaluating the Model#
To run evaluation for a MAL model, use this command:
tao model mal evaluate [-h] -e <experiment_spec_file>
evaluate.checkpoint=<model to be evaluated>
[evaluate.<evaluate_option>=<evaluate_option_value>]
[evaluate.gpu_ids=<gpu indices>]
[evaluate.num_gpus=<number of gpus>]
Required Arguments#
-e, --experiment_spec
: The experiment spec file to set up the evaluation experiment.evaluate.checkpoint
: The.pth
model to be evaluated.
Optional Arguments#
evaluate.<evaluate_option>
: The evaluate options.
Running Inference#
The inference
tool for MAL networks can be used to generate pseudo masks.
Here’s an example of using this tool:
tao model mal inference [-h] -e <experiment spec file>
inference.checkpoint=<model to be inferenced>
[inference.<inference_option>=<inference_option_value>]
[inference.gpu_ids=<gpu indices>]
[inference.num_gpus=<number of gpus>]
Required Arguments#
-e, --experiment_spec
: The experiment spec file to set up the inference experiment.inference.checkpoint
: The.pth
model to inference.
Optional Arguments#
inference.<inference_option>
: The inference options.