Mask Auto Labeler#
Mask Auto Labeler (MAL) is a high-quality, transformer-based mask auto-labeling framework for instance segmentation using only box annotations. It supports the following tasks:
train
evaluate
inference
These tasks may be invoked from the TAO Launcher using the following convention on the command line:
tao mal <sub_task> <args_per_subtask>
Where args_per_subtask
are the command-line arguments required for a given subtask. Each of
these subtasks are explained in detail below.
Creating a Configuration File#
SPECS=$(tao-client mal get-spec --action train --job_type experiment --id $EXPERIMENT_ID)
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
Below is a sample MAL spec file. It has five components–model
, inference
,
evaluate
, dataset
, and train
–as well as several global parameters,
which are described below. The format of the spec file is a YAML file.
strategy: 'fsdp'
results_dir: '/path/to/result/dir'
dataset:
train_ann_path: '/datasets/coco/annotations/instances_train2017.json'
train_img_dir: '/datasets/coco/raw-data/train2017'
val_ann_path: '/coco/annotations/instances_val2017.json'
val_img_dir: '/datasets/coco/raw-data/val2017'
load_mask: True
crop_size: 512
inference:
ann_path: '/dataset/sample.json'
img_dir: '/dataset/sample_dir'
label_dump_path: '/dataset/sample_output.json'
model:
arch: 'vit-mae-base/16'
train:
num_epochs: 10
checkpoint_interval: 5
validation_interval: 5
batch_size: 4
seed: 1234
num_gpus: 1
gpu_ids: [0]
use_amp: True
optim_momentum: 0.9
lr: 0.0000015
min_lr_rate: 0.2
wd: 0.0005
warmup_epochs: 1
crf_kernel_size: 3
crf_num_iter: 100
loss_mil_weight: 4
loss_crf_weight: 0.5
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
dict config |
– |
The configuration of the model architecture |
|
|
dict config |
– |
The configuration of the dataset |
|
|
dict config |
– |
The configuration of the training task |
|
|
dict config |
– |
The configuration of the evaluation task |
|
|
dict config |
– |
The configuration of the inference task |
|
|
string |
None |
The encryption key to encrypt and decrypt model files |
|
|
string |
/results |
The directory where experiment results are saved |
|
|
string |
‘ddp’ |
The distributed training strategy |
‘ddp’, ‘fsdp’ |
Dataset Config#
The dataset configuration (dataset
) defines the data source and input size.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
– |
The path to the training annotation JSON file |
|
|
string |
– |
The path to the validation annotation JSON file |
|
|
string |
– |
The path to the training image directory |
|
|
string |
– |
The path to the validation annotation JSON file |
|
|
Unsigned int |
512 |
The effective input size of the model |
|
|
boolean |
True |
A flag specifying whether to load the segmentation mask from the JSON file |
|
|
float |
2048 |
The minimum object size for training |
|
|
float |
1e10 |
The maximum object size for training |
|
|
Unsigned int |
The number of workers to load data for each GPU |
Model Config#
The model configuration (model
) defines the model architecture.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
vit-mae-base/16 |
The backbone architecture Supported backbones include the following:
|
|
|
List[int] |
[-1] |
The indices of the frozen blocks |
|
|
Unsigned int |
4 |
The number of conv layers in the mask head |
|
|
Unsigned int |
256 |
The number of conv channels in the mask head |
|
|
Unsigned int |
256 |
The number of output channels in the mask head |
|
|
float |
0.996 |
The momentum of the teacher model |
Train Config#
The training configuration (train
) specifies the parameters for the training process.
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
unsigned int |
1 |
The number of GPUs to use for distributed training |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed training |
|
|
unsigned int |
1234 |
The random seed for random, numpy, and torch |
>0 |
|
unsigned int |
10 |
The total number of epochs to run the experiment |
>0 |
|
unsigned int |
1 |
The epoch interval at which the checkpoints are saved |
>0 |
|
unsigned int |
1 |
The epoch interval at which the validation is run |
>0 |
|
string |
The intermediate PyTorch Lightning checkpoint to resume training from |
||
|
string |
/results/train |
The directory to save training results |
|
|
Unsigned int |
The training batch size |
||
|
boolean |
True |
A flag specifying whether to use mixed precision |
|
|
float |
0.9 |
The momentum of the AdamW optimizer |
|
|
float |
0.0000015 |
The learning rate |
|
|
float |
0.2 |
The minimum learning rate ratio |
|
|
float |
0.0005 |
The weight decay |
|
|
Unsigned int |
1 |
The number of epochs for warmup |
|
|
Unsigned int |
3 |
The kernel size of the mean field approximation |
|
|
Unsigned int |
100 |
The number of iterations to run mask refinement |
|
|
float |
4 |
The weight of multiple instance learning loss |
|
|
float |
0.5 |
The weight of conditional random field loss |
Evaluation Config#
The evaluation configuration (evaluate
) specifies the parameters for the validation during training as well as the standalone evaluation.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
Path to PyTorch model to evaluate |
||
|
string |
/results/evaluate |
The directory to save evaluation results |
|
|
unsigned int |
1 |
The number of GPUs to use for distributed evaluation |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed evaluation |
|
|
Unsigned int |
The evaluation batch size |
||
|
boolean |
False |
A flag specifying whether to evaluate with a mixed model |
|
|
boolean |
False |
A flag specifying whether to evaluate with the teacher model |
Inference Config#
The inference configuration (inference
) specifies the parameters for generating pseudo masks given the groundtruth bounding boxes in COCO format.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
Path to PyTorch model to inference |
||
|
string |
/results/inference |
The directory to save inference results |
|
|
unsigned int |
1 |
The number of GPUs to use for distributed inference |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed inference |
|
|
string |
The path to the annotation JSON file |
||
|
string |
The image directory |
||
|
string |
The path to save the output JSON file with pseudo masks |
||
|
Unsigned int |
The inference batch size |
||
|
boolean |
False |
A flag specifying whether to load masks if the annotation file has them |
Training the Model#
Use the following command to run MAL training:
TRAIN_JOB_ID=$(tao-client mal experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mal train [-h] -e <experiment_spec>
[results_dir=<global_results_dir>]
[model.<model_option>=<model_option_value>]
[dataset.<dataset_option>=<dataset_option_value>]
[train.<train_option>=<train_option_value>]
[train.gpu_ids=<gpu indices>]
[train.num_gpus=<number of gpus>]
Required Arguments
The only required argument is the path to the experiment spec:
-e, --experiment_spec
: The experiment specification file to set up the training experiment
Optional Arguments
You can set optional arguments to override the option values in the experiment spec file.
-h, --help
: Show this help message and exit.model.<model_option>
: The model options.dataset.<dataset_option>
: The dataset options.train.<train_option>
: The train options.
Note
For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus
and gpu_ids
, which
default to 1
and [0]
, respectively. If both are passed, but inconsistent, for example num_gpus = 1
,
gpu_ids = [0, 1]`, then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2
.
In some cases, you may encounter an issue with multi-GPU training resulting in a segmentation fault. You may circumvent this by setting the OMP_NUM_THREADS enviroment variable to 1. Depending upon your model of execution, you may use the following methods to set this variable
CLI Launcher
You may set this env variable by adding the following fields to the Envs field of your ~/.tao_mounts.json
file as mentioned in bullet 3
in this section
{
"Envs": [
{
"variable": "OMP_NUM_THREADSR",
"value": "1"
}
]
}
Docker
You may set environment variables in the docker by setting the -e
flag in the docker command line.
docker run -it --rm --gpus all \
-e OMP_NUM_THREADS=1 \
-v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e
Checkpointing and Resuming Training
At every train.checkpoint_interval
, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth
.
These are saved in train.results_dir
, like so:
$ ls /results/train
'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'
The latest checkpoint is also saved as mal_model_latest.pth
.
Training automatically resumes from mal_model_latest.pth
, if it exists in train.results_dir
.
This is superseded by train.resume_training_checkpoint_path
, if it is provided.
The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:
Specify a new, empty results directory (Recommended)
Remove the latest checkpoint from the results directory
Evaluating the Model#
To run evaluation for a MAL model, use this command:
EVAL_JOB_ID=$(tao-client mal experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mal evaluate [-h] -e <experiment_spec_file>
evaluate.checkpoint=<model to be evaluated>
[evaluate.<evaluate_option>=<evaluate_option_value>]
[evaluate.gpu_ids=<gpu indices>]
[evaluate.num_gpus=<number of gpus>]
Required Arguments
The following arguments are required.
-e, --experiment_spec
: The experiment spec file to set up the evaluation experiment.evaluate.checkpoint
: The.pth
model to be evaluated.
Optional Arguments
The following arguments are optional to run the command.
evaluate.<evaluate_option>
: The evaluate options.
Running Inference#
The inference
tool for MAL networks can be used to generate pseudo masks.
Here’s an example of using this tool:
INFERENCE_JOB_ID=$(tao-client mal experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mal inference [-h] -e <experiment spec file>
inference.checkpoint=<model to be inferenced>
[inference.<inference_option>=<inference_option_value>]
[inference.gpu_ids=<gpu indices>]
[inference.num_gpus=<number of gpus>]
Required Arguments
The following arguments are required to run the command.
-e, --experiment_spec
: The experiment spec file to set up the inference experiment.inference.checkpoint
: The.pth
model to inference.
Optional Arguments
The following arguments are optional to run the command.
inference.<inference_option>
: The inference options.