NVIDIA TAO Toolkit v4.0
NVIDIA TAO Release tlt.40

AutoML

AutoML is a TAO Toolkit API service that automatically selects deep learning hyperparameters for a chosen model and dataset. The TAO Toolkit API provides a Jupyter notebook interface to try the AutoML feature.

Each AutoML run contains multiple training experiments. At the end of an AutoML run, you can access the config containing the hyperparameters of the best performing model amongst the multiple experiments, along with the binary weight file for deploying the model to your application.

Object Detection

  • DetectNet V2

  • EfficientDet

  • FasterRCNN

  • YoloV3

  • YoloV4

  • YoloV4_Tiny

  • SSD

  • Retinanet

Image Segmentation

  • Mask RCNN (Instance Segmentation)

  • UNet (Semantic Segmentation)

Classification

  • Multi-class classification

  • Multi-task classification

Special Use Case Models

  • License Plate recognition

More tasks and models supported by TAO toolkit, such as Conversational AI applications and Emotion-Detection, will be added in future releases.

  • Before using AutoML, you need to deploy the TAO REST API service using the TAO Toolkit API Setup steps.

    • The Hardware and Software requirements mentioned in the TAO Toolkit API Setup section are also applicable to AutoML.

  • After deploying the TAO Toolkit API, you can use the following commands on the host machine to obtain the node ip_address and port_number for use in the notebooks:

    • ip_address: hostname -i

    • port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

The TAO Toolkit API provides two types of notebook to interact with the deployed AutoML service:

  • TAO-API Normal AutoML notebook

  • TAO-API Client AutoML notebook, which allows users to run commands from the command line interface

Once a notebook is available on your computer, you can run the notebook with the following:

Copy
Copied!
            

cd <path_to_your_notebook> jupyter-lab --ip 0.0.0.0 --allow-root --port 8888

Note

We recommend running only one AutoML notebook at a time. Each experiment within AutoML is a separate Kubernetes job, and Kubernetes performs scheduling based on jobs submitted to the queue. Hence, running multiple models across notebooks will fill the queue with AutoML experiments of different models, extending the execution time to complete a single AutoML experiment.

[Mandatory] User Inputs

For each notebook, you need to provide/modify the following parameters:

  • model_name: The list of supported network models will be present in the notebook cell.

  • dataset path: For each task, the dataset should be in a certain folder structure (provided in the dataset description in the notebooks).

  • automl_algorithm: The AutoML algorithm options to use. A brief explanation of the algorithms can be found in the AutoML Algorithm Explanation section.

    • Bayesian: Adaptively proposes new hyperparameters based on past experiments. This algorithm generally provides better accuracy results, but with a longer execution time for all experiments, when compared to Hyperband.

    • Hyperband: Hyperband accelerates the expensive hyperparameter optimization using random search with adaptive resource allocation.

[Optional] User Inputs

For each model, TAO has set a default set of parameters to run the AutoML search. You can add new valid parameters via additional_automl_parameters or remove some parameters from the default list via remove_default_automl_parameters`.

Each notebook contains hyperlinks in the Set AutoML related configurations cell to view the valid list of parameters of a particular model:

  • additional_automl_parameters: Add additional parameters to the AutoML search algorithm using this list of strings (e.g. additional_automl_parameters = ['parameter1','parameter2']).

    Any parameter in the hyperlink table that does not have the automl_enabled column set to True/False can be added to the AutoML search space.

    For example, consider the list of DetectNet_v2 parameters:

    • dataset_config.target_class_mapping can’t be added to the additional_automl_parameters list as it is not eligible to be included in the search space.

    • training_config.regularizer.weight doesn’t make sense to add to this list, as it is already enabled.

    • augmentation_config.preprocessing.output_image_width can be added, as the automl_enabled column is set to neither True nor False.

  • remove_default_automl_parameters: Remove parameters that are enabled by default for AutoML search (e.g. remove_default_automl_parameters = ['parameter1','parameter2']).

    Any parameter in the hyperlink table that has the automl_enabled column set to True can be removed from the AutoML search space.

    For example, in the list of DetectNet_v2 parameters, training_config.regularizer.weight can be removed from the AutoML search space.

[Optional] AutoML Algorithm-Specific Parameters

There are some algorithm-specific parameters set to default values that determine the AutoML experiment schedules. You can optionally modify them as follows:

Bayesian

  • max_recommendations: The maximum number of full-scale training experiments to run. The default value is 20.

    For example, setting this value to 10 will run 10 training experiments in sequential order, as the training config file of nth experiment is computed from the (n-1)th experiment. At the end of 10 experiments, the algorithm returns the training config file and binary weights to the experiment that achieved the best accuracy.

    • In the API notebook, enable this as follows:

      • automl_max_recommendations:number_of_recommendations you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.

    • In the CLI notebook, enable this as follows:

      • metadata["automl_max_recommendations"]:number_of_recommendations you want to set (integer) in the Set AutoML related configurations cell.

Hyperband

  • R: The maximum resources (i.e. the number of recommendations and maximum epochs)

    • The default value is set to 27 and is adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.

    • In the API notebook, enable this as follows:

      • automl_R: integer value inside automl_information dictionary variable in the Set AutoML related configurations cell

    • In the CLI notebook, enable this as follows:

      • metadata["automl_R"]: integer value in the Set AutoML related configurations cell.

  • Nu: The proportion of recommendations discarded in each use of the SuccessiveHalving algorithm. The default value is 3.

    • In the API notebook, enable this as follows:

      • automl_nu::value of nu you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.

    • In the CLI notebook, enable this as follows:

      • metadata["automl_nu"]::value of nu you want to set (integer) in the Set AutoML related configurations cell.

  • epoch_multiplier: The number of epochs, determined by multiplying this value with the resources (R). The default value is 10 and can be adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.

    • In the API notebook, enable this as follows:

      • epoch_multiplier:value of epoch_multiplier you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.

    • In the CLI notebook, enable this as follows:

      • metadata["epoch_multiplier"]:value of epoch_multiplier you want to set (integer) in the Set AutoML related configurations cell.

    The values of R and nu are computed to determine the number of experiemnts to run within each stage of the Hyperband run and the corresponding number of epochs for each experiment. For example, setting "R=27, Nu=3, epoch_multiplier=10" will run three stages of experiments, as described in Table 3 of the AutoML Algorithm Explanation section:

    • The first stage proposes 27 new recommendations to run for 10 epochs. Then, Hyperband keeps the 9 (27/3) best performing recommendations to run another 30 (10*3) epochs, and repeats until one recommendation remains.

    • The second and third stages respectively propose 9 and 3 new recommendations (based on Table 3) and follow the same procedure.

    • After all the experiments conclude, Hyperband returns the training config file and binary weights to the experiment that achieved the best accuracy.

Finally, the user can change any training-related spec parameters–such as batch size, learning rate, weight regularizer, and frequency of checkpoints–to be reflected during each experiment. This step is necessary because these kinds of parameters are often dependent on the computing hardware specifications (e.g. GPU memory size).

After changing the above parameters, you can run the cells of the notebook. While running the notebook, some cells take time to complete execution. In the API notebook, the cells poll the status of the job every 15 seconds. The cell execution is completed only when the status switches to Done or Error.

For the AutoML train cell, you can see the following status indicators during cell execution. For the TAO-Client AutoML notebook, you can either view these indicators or view the training logs of the experiments. By default, viewing status indicators is enabled; you can switch to viewing logs by setting poll_automl_stats = True

  • The current experiment AutoML is executing

  • The total number of epochs across different experiments still pending

  • An approximate estimated time remaining for AutoML experiment completion. Note that this is just the training time, in addition to the following times, which are dependent on the config values you set:

    • The time for evaluation at every n epochs

    • The time for miscellaneous actions like saving a checkpoint at every n epochs

    • The time for loading the model from disk for resuming training in Hyperband

  • The best accuracy value until the current experiment

Resultant Files

After downloading the job contents of AutoML experiments, the folder uses the following structure:

Copy
Copied!
            

├── automl_metadata.json ├── brain.json ├── controller.json │ ............... ├── best_model │   ├── log.txt │   ├── recommendation_5.kitti │   ├── status.json │   ├── ..... │   └── weights │      └── weights.tlt │   ............... ├── experiment_0 │   ├── log.txt │   ├── status.json │   ├── ..... │   ............... ├── recommendation_0.kitti ├── recommendation_1.kitti ├── recommendation_2.kitti ..............

  • controller.json: Summarizes all experiment details regarding the results, target hyperparam values, and status (success or failure).

  • best_model: Contains the best performing model outcomes (specs, logs, weights) out of all experiment sets in the AutoML runs. Note that the weight file (.tlt format) is located in either the best_model or best_model/weights folder, depending on the network model.

  • experiment_n: Contains the contents (specs, logs) of the nth experiment model outcomes in AutoML.

  • recommendtaions_n.kitti: Contains the entire spec information for nth experiments in AutoML.

Results of AutoML experiments

The below table provides several experimental results of network models on different AI applications.

  • These numbers are profiled with the default AutoML specific parameters: max_recommendation=20 for Bayesian and R=27, nu=3 for Hyperband. For some network models (efficientdet, mask_rcnn, multitask_classification), the default automl parameter is adjusted by the Hyperband Parameter Auto-adjustment Mechanism to be R=9, nu=3 for Hyperband.

  • All experiments are executed on a single GPU (Tesla V100) with Intel Xeon CPU E5-2698 v4.

  • When using multi-GPU mode, the time taken is expected to scale accordingly; similarly, using more powerful GPUs will reduce the execution time.

Table1. The estimated time for an AutoML (Bayesian/Hyperband) experiment for each network model

model

epoch

dataset

single experiment

Bayesian

Hyperband

detectnet_v2

80

FLIR20

45.7min

913.2min

534.2min

efficientdet

6

FLIR20

82min

1640min

410min

faster_rcnn

80

FLIR20

252min

5040 min

2948.4min

retinanet

100

FLIR20

75min

1500min

702min

ssd

80

FLIR20

174min

3480min

2035min

yolo3

80

FLIR20

60min

1200min

702min

yolo4

80

FLIR20

110min

2200min

1287min

yolo4_tiny

80

FLIR20

87min

1740min

1017.9min

mask_rcnn

5

FLIR20

66min

1320min

237.6min

lprnet

24

OpenALPR

1.1min

22min

39.2min

multiclass_classification

80

Pascal VOC

80min

1600min

1279.7min

multitask_classification

10

Fashion Product

12.5min

250min

799.8min

unet

50

ISBI

4.5min

90min

98.5min

This section gives additional details of two AutoML algorithms.

Bayesian Optimization

Bayesian optimization aims to identify optimal configurations more quickly than standard baselines (such as standard random search) by adaptively selecting hyperparameters based on past experimental information.

  • Use a surrogate model to fit the Gaussian Process (GP) to existing data (X,y), where X is a vector of recommendations and y is the observed validation map.

  • Bayesian Optimization adaptively proposes new recommendations based on the fitted GP that produces the best improvement in expectation until reaching the number of max_recommendations.

Hyperband

Hyperband addresses the issue of expensive hyperparameter optimization by speeding up random search through adaptive resource allocation. Hyperband follows the SuccessiveHalving algorithm to uniformly allocate a budget to a set of hyperparameter recommendations. During AutoML runs, Hyperband evaluates the performance of all recommendations, throws out the worst half, and repeats this process until one recommendation remains (see this research paper for more details). Three hyperband-related parameters (R, nu, epoch_multiplier) pre-define a hyperband sequence of trials based on how many trials to perform in each round and how many resources to give to each trial in that round.

  • The R parameter determines the maximum resources (i.e. the number of epochs and recommendations).

  • The Nu parameter controls the proportion of recommendations discarded in each use of the SuccessiveHalving algorithm.

  • The epoch_multiplier and the resources (r) are multiplied in each SuccessiveHalving iterations(i) to determine the epoch number of each trial. Each run of SuccessiveHalving is referred to as a stage, and each stage contains SuccessiveHalving iterations(i).

The below tables describe the SuccessiveHalving iterations(i), number of hyperparameter recommendation(n), and resources(r) given the R and nu.

Table2. The pre-defined Hyperband experiment schedule when set with R = 81, nu=3.

s=0

s=1

s=2

s=3

i

n

r

n

r

n

r

n

r

0

81

1

27

3

9

9

6

27

1

27

3

9

9

3

27

2

81

2

9

9

3

27

1

81

3

3

27

1

81

4

1

81

Table3. The pre-defined Hyperband experiment schedule when set with R = 27, nu=3.

s=0

s=1

s=2

i

n

r

n

r

n

r

0

27

1

9

3

6

9

1

9

3

3

9

2

27

2

3

9

1

27

3

1

27

By default, the Hyperband pamameter R is set to 27 and nu is set to 3. Based on these values and the training max_epoch of each experiment, the hyperband parameter auto-adjustment mechanism calculates the epoch_multiplier by max_epoch^(log(nu)/log(R)). Here, if the calculated epoch_multiplier is less than 3, then R is reduced by 1 and the epoch_multiplier is recalculated until this value is equal to or greater than 3.

© Copyright 2022, NVIDIA.. Last updated on Mar 23, 2023.