Mask Auto Labeler#
Mask Auto Labeler (MAL) is a high-quality, transformer-based mask auto-labeling framework for instance segmentation using only box annotations. It supports the following tasks:
train
evaluate
inference
These tasks may be invoked from the TAO Launcher using the following convention on the command line:
tao mal <sub_task> <args_per_subtask>
Where args_per_subtask are the command-line arguments required for a given subtask. Each of
these subtasks are explained in detail below.
Creating a Configuration File#
BASE_EXPERIMENT_ID=$(tao mal list-base-experiments | jq -r '.[0].id')
SPECS=$(tao mal get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
Below is a sample MAL spec file. It has five components–model, inference,
evaluate, dataset, and train–as well as several global parameters,
which are described below. The format of the spec file is a YAML file.
strategy: 'fsdp'
results_dir: '/path/to/result/dir'
dataset:
train_ann_path: '/datasets/coco/annotations/instances_train2017.json'
train_img_dir: '/datasets/coco/raw-data/train2017'
val_ann_path: '/coco/annotations/instances_val2017.json'
val_img_dir: '/datasets/coco/raw-data/val2017'
load_mask: True
crop_size: 512
inference:
ann_path: '/dataset/sample.json'
img_dir: '/dataset/sample_dir'
label_dump_path: '/dataset/sample_output.json'
model:
arch: 'vit-mae-base/16'
train:
num_epochs: 10
checkpoint_interval: 5
validation_interval: 5
batch_size: 4
seed: 1234
num_gpus: 1
gpu_ids: [0]
use_amp: True
optim_momentum: 0.9
lr: 0.0000015
min_lr_rate: 0.2
wd: 0.0005
warmup_epochs: 1
crf_kernel_size: 3
crf_num_iter: 100
loss_mil_weight: 4
loss_crf_weight: 0.5
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
dict config |
– |
The configuration of the model architecture |
|
|
dict config |
– |
The configuration of the dataset |
|
|
dict config |
– |
The configuration of the training task |
|
|
dict config |
– |
The configuration of the evaluation task |
|
|
dict config |
– |
The configuration of the inference task |
|
|
string |
None |
The encryption key to encrypt and decrypt model files |
|
|
string |
/results |
The directory where experiment results are saved |
|
|
string |
‘ddp’ |
The distributed training strategy |
‘ddp’, ‘fsdp’ |
Dataset Config#
The dataset configuration (dataset) defines the data source and input size.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
– |
The path to the training annotation JSON file |
|
|
string |
– |
The path to the validation annotation JSON file |
|
|
string |
– |
The path to the training image directory |
|
|
string |
– |
The path to the validation annotation JSON file |
|
|
Unsigned int |
512 |
The effective input size of the model |
|
|
boolean |
True |
A flag specifying whether to load the segmentation mask from the JSON file |
|
|
float |
2048 |
The minimum object size for training |
|
|
float |
1e10 |
The maximum object size for training |
|
|
Unsigned int |
The number of workers to load data for each GPU |
Model Config#
The model configuration (model) defines the model architecture.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
vit-mae-base/16 |
The backbone architecture Supported backbones include the following:
|
|
|
List[int] |
[-1] |
The indices of the frozen blocks |
|
|
Unsigned int |
4 |
The number of conv layers in the mask head |
|
|
Unsigned int |
256 |
The number of conv channels in the mask head |
|
|
Unsigned int |
256 |
The number of output channels in the mask head |
|
|
float |
0.996 |
The momentum of the teacher model |
Train Config#
The training configuration (train) specifies the parameters for the training process.
Parameter |
Datatype |
Default |
Description |
Supported Values |
|
unsigned int |
1 |
The number of GPUs to use for distributed training |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed training |
|
|
unsigned int |
1234 |
The random seed for random, numpy, and torch |
>0 |
|
unsigned int |
10 |
The total number of epochs to run the experiment |
>0 |
|
unsigned int |
1 |
The epoch interval at which the checkpoints are saved |
>0 |
|
unsigned int |
1 |
The epoch interval at which the validation is run |
>0 |
|
string |
The intermediate PyTorch Lightning checkpoint to resume training from |
||
|
string |
/results/train |
The directory to save training results |
|
|
Unsigned int |
The training batch size |
||
|
boolean |
True |
A flag specifying whether to use mixed precision |
|
|
float |
0.9 |
The momentum of the AdamW optimizer |
|
|
float |
0.0000015 |
The learning rate |
|
|
float |
0.2 |
The minimum learning rate ratio |
|
|
float |
0.0005 |
The weight decay |
|
|
Unsigned int |
1 |
The number of epochs for warmup |
|
|
Unsigned int |
3 |
The kernel size of the mean field approximation |
|
|
Unsigned int |
100 |
The number of iterations to run mask refinement |
|
|
float |
4 |
The weight of multiple instance learning loss |
|
|
float |
0.5 |
The weight of conditional random field loss |
Evaluation Config#
The evaluation configuration (evaluate) specifies the parameters for the validation during training as well as the standalone evaluation.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
Path to PyTorch model to evaluate |
||
|
string |
/results/evaluate |
The directory to save evaluation results |
|
|
unsigned int |
1 |
The number of GPUs to use for distributed evaluation |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed evaluation |
|
|
Unsigned int |
The evaluation batch size |
||
|
boolean |
False |
A flag specifying whether to evaluate with a mixed model |
|
|
boolean |
False |
A flag specifying whether to evaluate with the teacher model |
Inference Config#
The inference configuration (inference) specifies the parameters for generating pseudo masks given the groundtruth bounding boxes in COCO format.
Field |
Datatype |
Default |
Description |
Supported Values |
|
string |
Path to PyTorch model to inference |
||
|
string |
/results/inference |
The directory to save inference results |
|
|
unsigned int |
1 |
The number of GPUs to use for distributed inference |
>0 |
|
List[int] |
[0] |
The indices of the GPU’s to use for distributed inference |
|
|
string |
The path to the annotation JSON file |
||
|
string |
The image directory |
||
|
string |
The path to save the output JSON file with pseudo masks |
||
|
Unsigned int |
The inference batch size |
||
|
boolean |
False |
A flag specifying whether to load masks if the annotation file has them |
Training the Model#
Use the following command to run MAL training:
TRAIN_JOB_ID=$(tao mal create-job \
--kind experiment \
--name "mal_train" \
--action train \
--workspace-id $WORKSPACE_ID \
--specs "$TRAIN_SPECS" \
--train-datasets '["'$DATASET_ID'"]' \
--eval-dataset "$DATASET_ID" \
--base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
--encryption-key "nvidia_tlt" | jq -r '.id')
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mal train [-h] -e <experiment_spec>
[results_dir=<global_results_dir>]
[model.<model_option>=<model_option_value>]
[dataset.<dataset_option>=<dataset_option_value>]
[train.<train_option>=<train_option_value>]
[train.gpu_ids=<gpu indices>]
[train.num_gpus=<number of gpus>]
Required Arguments
The only required argument is the path to the experiment spec:
-e, --experiment_spec: The experiment specification file to set up the training experiment
Optional Arguments
You can set optional arguments to override the option values in the experiment spec file.
-h, --help: Show this help message and exit.model.<model_option>: The model options.dataset.<dataset_option>: The dataset options.train.<train_option>: The train options.
Note
For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids, which
default to 1 and [0], respectively. If both are passed, but are inconsistent, for example num_gpus = 1,
gpu_ids = [0, 1], then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2.
In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by
setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set
this variable:
CLI Launcher:
You may set the environment variable by adding the following fields to the
Envsfield of your~/.tao_mounts.jsonfile as mentioned in bullet 3 in ths section Running the launcher.{ "Envs": [ { "variable": "OMP_NUM_THREADSR", "value": "1" } }
Docker:
You may set environment variables in Docker by setting the
-eflag in the Docker command line.docker run -it --rm --gpus all \ -e OMP_NUM_THREADS=1 \ -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e
Checkpointing and Resuming Training
At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth.
Checkpoints are saved in train.results_dir, like this:
$ ls /results/train
'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'
The latest checkpoint is also saved as mal_model_latest.pth.
Training automatically resumes from mal_model_latest.pth, if it exists in train.results_dir.
This is superseded by train.resume_training_checkpoint_path, if it is provided.
The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:
Specify a new, empty results directory (Recommended)
Remove the latest checkpoint from the results directory
Evaluating the Model#
To run evaluation for a MAL model, use this command:
EVAL_JOB_ID=$(tao mal create-job \
--kind experiment \
--name "mal_evaluate" \
--action evaluate \
--workspace-id $WORKSPACE_ID \
--parent-job-id $TRAIN_JOB_ID \
--eval-dataset "$DATASET_ID" \
--specs "$EVALUATE_SPECS" \
--base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
--encryption-key "nvidia_tlt" | jq -r '.id')
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mal evaluate [-h] -e <experiment_spec_file>
evaluate.checkpoint=<model to be evaluated>
[evaluate.<evaluate_option>=<evaluate_option_value>]
[evaluate.gpu_ids=<gpu indices>]
[evaluate.num_gpus=<number of gpus>]
Required Arguments
The following arguments are required.
-e, --experiment_spec: The experiment spec file to set up the evaluation experiment.evaluate.checkpoint: The.pthmodel to be evaluated.
Optional Arguments
The following arguments are optional to run the command.
evaluate.<evaluate_option>: The evaluate options.
Running Inference#
The inference tool for MAL networks can be used to generate pseudo masks.
Here’s an example of using this tool:
INFERENCE_JOB_ID=$(tao mal create-job \
--kind experiment \
--name "mal_inference" \
--action inference \
--workspace-id $WORKSPACE_ID \
--parent-job-id $TRAIN_JOB_ID \
--inference-dataset "$DATASET_ID" \
--specs "$INFERENCE_SPECS" \
--base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
--encryption-key "nvidia_tlt" | jq -r '.id')
See also
For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
tao model mal inference [-h] -e <experiment spec file>
inference.checkpoint=<model to be inferenced>
[inference.<inference_option>=<inference_option_value>]
[inference.gpu_ids=<gpu indices>]
[inference.num_gpus=<number of gpus>]
Required Arguments
The following arguments are required to run the command.
-e, --experiment_spec: The experiment spec file to set up the inference experiment.inference.checkpoint: The.pthmodel to inference.
Optional Arguments
The following arguments are optional to run the command.
inference.<inference_option>: The inference options.