Mask Auto Labeler#

Mask Auto Labeler (MAL) is a high-quality, transformer-based mask auto-labeling framework for instance segmentation using only box annotations. It supports the following tasks:

train
evaluate
inference

These tasks may be invoked from the TAO Launcher using the following convention on the command line:

tao mal <sub_task> <args_per_subtask>

Where args_per_subtask are the command-line arguments required for a given subtask. Each of these subtasks are explained in detail below.

Creating a Configuration File#

TAO Client (v2 API)

BASE_EXPERIMENT_ID=$(tao mal list-base-experiments | jq -r '.[0].id')
SPECS=$(tao mal get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

Below is a sample MAL spec file. It has five components–model, inference, evaluate, dataset, and train–as well as several global parameters, which are described below. The format of the spec file is a YAML file.

strategy: 'fsdp'
results_dir: '/path/to/result/dir'
dataset:
  train_ann_path: '/datasets/coco/annotations/instances_train2017.json'
  train_img_dir: '/datasets/coco/raw-data/train2017'
  val_ann_path: '/coco/annotations/instances_val2017.json'
  val_img_dir: '/datasets/coco/raw-data/val2017'
  load_mask: True
  crop_size: 512
inference:
  ann_path: '/dataset/sample.json'
  img_dir: '/dataset/sample_dir'
  label_dump_path: '/dataset/sample_output.json'
model:
  arch: 'vit-mae-base/16'
train:
  num_epochs: 10
  checkpoint_interval: 5
  validation_interval: 5
  batch_size: 4
  seed: 1234
  num_gpus: 1
  gpu_ids: [0]
  use_amp: True
  optim_momentum: 0.9
  lr: 0.0000015
  min_lr_rate: 0.2
  wd: 0.0005
  warmup_epochs: 1
  crf_kernel_size: 3
  crf_num_iter: 100
  loss_mil_weight: 4
  loss_crf_weight: 0.5

Parameter	Datatype	Default	Description	Supported Values
`model`	dict config	–	The configuration of the model architecture
`dataset`	dict config	–	The configuration of the dataset
`train`	dict config	–	The configuration of the training task
`evaluate`	dict config	–	The configuration of the evaluation task
`inference`	dict config	–	The configuration of the inference task
`encryption_key`	string	None	The encryption key to encrypt and decrypt model files
`results_dir`	string	/results	The directory where experiment results are saved
`strategy`	string	‘ddp’	The distributed training strategy	‘ddp’, ‘fsdp’

Dataset Config#

The dataset configuration (dataset) defines the data source and input size.

Field	Datatype	Default	Description	Supported Values
`train_ann_path`	string	–	The path to the training annotation JSON file
`val_ann_path`	string	–	The path to the validation annotation JSON file
`train_img_dir`	string	–	The path to the training image directory
`val_img_dir`	string	–	The path to the validation annotation JSON file
`crop_size`	Unsigned int	512	The effective input size of the model
`load_mask`	boolean	True	A flag specifying whether to load the segmentation mask from the JSON file
`min_obj_size`	float	2048	The minimum object size for training
`max_obj_size`	float	1e10	The maximum object size for training
`num_workers_per_gpu`	Unsigned int		The number of workers to load data for each GPU

Model Config#

The model configuration (model) defines the model architecture.

Field	Datatype	Default	Description	Supported Values
`arch`	string	vit-mae-base/16	The backbone architecture Supported backbones include the following: vit-deit-tiny/16 vit-deit-small/16 vit-mae-base/16 vit-mae-large/16 vit-mae-huge/14 fan_tiny_12_p16_224 fan_small_12_p16_224 fan_base_18_p16_224 fan_large_24_p16_224 fan_tiny_8_p4_hybrid fan_small_12_p4_hybrid fan_base_16_p4_hybrid fan_large_16_p4_hybrid
`frozen_stages`	List[int]	[-1]	The indices of the frozen blocks
`mask_head_num_convs`	Unsigned int	4	The number of conv layers in the mask head
`mask_head_hidden_channel`	Unsigned int	256	The number of conv channels in the mask head
`mask_head_out_channel`	Unsigned int	256	The number of output channels in the mask head
`teacher_momentum`	float	0.996	The momentum of the teacher model

Train Config#

The training configuration (train) specifies the parameters for the training process.

Parameter	Datatype	Default	Description	Supported Values
`num_gpus`	unsigned int	1	The number of GPUs to use for distributed training	>0
`gpu_ids`	List[int]	[0]	The indices of the GPU’s to use for distributed training
`seed`	unsigned int	1234	The random seed for random, numpy, and torch	>0
`num_epochs`	unsigned int	10	The total number of epochs to run the experiment	>0
`checkpoint_interval`	unsigned int	1	The epoch interval at which the checkpoints are saved	>0
`validation_interval`	unsigned int	1	The epoch interval at which the validation is run	>0
`resume_training_checkpoint_path`	string		The intermediate PyTorch Lightning checkpoint to resume training from
`results_dir`	string	/results/train	The directory to save training results
`batch_size`	Unsigned int		The training batch size
`use_amp`	boolean	True	A flag specifying whether to use mixed precision
`optim_momentum`	float	0.9	The momentum of the AdamW optimizer
`lr`	float	0.0000015	The learning rate
`min_lr_rate`	float	0.2	The minimum learning rate ratio
`wd`	float	0.0005	The weight decay
`warmup_epochs`	Unsigned int	1	The number of epochs for warmup
`crf_kernel_size`	Unsigned int	3	The kernel size of the mean field approximation
`crf_num_iter`	Unsigned int	100	The number of iterations to run mask refinement
`loss_mil_weight`	float	4	The weight of multiple instance learning loss
`loss_crf_weight`	float	0.5	The weight of conditional random field loss

Evaluation Config#

The evaluation configuration (evaluate) specifies the parameters for the validation during training as well as the standalone evaluation.

Field	Datatype	Default	Description	Supported Values
`checkpoint`	string		Path to PyTorch model to evaluate
`results_dir`	string	/results/evaluate	The directory to save evaluation results
`num_gpus`	unsigned int	1	The number of GPUs to use for distributed evaluation	>0
`gpu_ids`	List[int]	[0]	The indices of the GPU’s to use for distributed evaluation
`batch_size`	Unsigned int		The evaluation batch size
`use_mixed_model_test`	boolean	False	A flag specifying whether to evaluate with a mixed model
`use_teacher_test`	boolean	False	A flag specifying whether to evaluate with the teacher model

Inference Config#

The inference configuration (inference) specifies the parameters for generating pseudo masks given the groundtruth bounding boxes in COCO format.

Field	Datatype	Default	Description	Supported Values
`checkpoint`	string		Path to PyTorch model to inference
`results_dir`	string	/results/inference	The directory to save inference results
`num_gpus`	unsigned int	1	The number of GPUs to use for distributed inference	>0
`gpu_ids`	List[int]	[0]	The indices of the GPU’s to use for distributed inference
`ann_path`	string		The path to the annotation JSON file
`img_dir`	string		The image directory
`label_dump_path`	string		The path to save the output JSON file with pseudo masks
`batch_size`	Unsigned int		The inference batch size
`load_mask`	boolean	False	A flag specifying whether to load masks if the annotation file has them

Training the Model#

Use the following command to run MAL training:

TAO Client (v2 API)

TRAIN_JOB_ID=$(tao mal create-job \
  --kind experiment \
  --name "mal_train" \
  --action train \
  --workspace-id $WORKSPACE_ID \
  --specs "$TRAIN_SPECS" \
  --train-datasets '["'$DATASET_ID'"]' \
  --eval-dataset "$DATASET_ID" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model mal train [-h] -e <experiment_spec>
            [results_dir=<global_results_dir>]
            [model.<model_option>=<model_option_value>]
            [dataset.<dataset_option>=<dataset_option_value>]
            [train.<train_option>=<train_option_value>]
            [train.gpu_ids=<gpu indices>]
            [train.num_gpus=<number of gpus>]

Required Arguments

The only required argument is the path to the experiment spec:

-e, --experiment_spec: The experiment specification file to set up the training experiment

Optional Arguments

You can set optional arguments to override the option values in the experiment spec file.

-h, --help: Show this help message and exit.
model.<model_option>: The model options.
dataset.<dataset_option>: The dataset options.
train.<train_option>: The train options.

Note

For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids, which default to 1 and [0], respectively. If both are passed, but are inconsistent, for example num_gpus = 1, gpu_ids = [0, 1], then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2.

In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set this variable:

CLI Launcher:

You may set the environment variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in ths section Running the launcher.
```
{
    "Envs": [
        {
            "variable": "OMP_NUM_THREADSR",
            "value": "1"
        }

}
```

Docker:

You may set environment variables in Docker by setting the -e flag in the Docker command line.

docker run -it --rm --gpus all \
    -e OMP_NUM_THREADS=1 \
    -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. Checkpoints are saved in train.results_dir, like this:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

The latest checkpoint is also saved as mal_model_latest.pth. Training automatically resumes from mal_model_latest.pth, if it exists in train.results_dir. This is superseded by train.resume_training_checkpoint_path, if it is provided.

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:

Specify a new, empty results directory (Recommended)
Remove the latest checkpoint from the results directory

Evaluating the Model#

To run evaluation for a MAL model, use this command:

TAO Client (v2 API)

EVAL_JOB_ID=$(tao mal create-job \
  --kind experiment \
  --name "mal_evaluate" \
  --action evaluate \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --eval-dataset "$DATASET_ID" \
  --specs "$EVALUATE_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model mal evaluate [-h] -e <experiment_spec_file>
            evaluate.checkpoint=<model to be evaluated>
            [evaluate.<evaluate_option>=<evaluate_option_value>]
            [evaluate.gpu_ids=<gpu indices>]
            [evaluate.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required.

-e, --experiment_spec: The experiment spec file to set up the evaluation experiment.
evaluate.checkpoint: The .pth model to be evaluated.

Optional Arguments

The following arguments are optional to run the command.

evaluate.<evaluate_option>: The evaluate options.

Running Inference#

The inference tool for MAL networks can be used to generate pseudo masks. Here’s an example of using this tool:

TAO Client (v2 API)

INFERENCE_JOB_ID=$(tao mal create-job \
  --kind experiment \
  --name "mal_inference" \
  --action inference \
  --workspace-id $WORKSPACE_ID \
  --parent-job-id $TRAIN_JOB_ID \
  --inference-dataset "$DATASET_ID" \
  --specs "$INFERENCE_SPECS" \
  --base-experiment-ids '["'$BASE_EXPERIMENT_ID'"]' \
  --encryption-key "nvidia_tlt" | jq -r '.id')

See also

For information on how to create an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.

TAO Launcher

tao model mal inference [-h] -e <experiment spec file>
            inference.checkpoint=<model to be inferenced>
            [inference.<inference_option>=<inference_option_value>]
            [inference.gpu_ids=<gpu indices>]
            [inference.num_gpus=<number of gpus>]

Required Arguments

The following arguments are required to run the command.

-e, --experiment_spec: The experiment spec file to set up the inference experiment.
inference.checkpoint: The .pth model to inference.

Optional Arguments

The following arguments are optional to run the command.

inference.<inference_option>: The inference options.