AutoML
AutoML is a TAO Toolkit API service that automatically selects deep learning hyperparameters for a chosen model and dataset. The TAO Toolkit API provides a Jupyter notebook interface to try the AutoML feature.
Each AutoML run contains multiple training experiments. At the end of an AutoML run, you can access the config containing the hyperparameters of the best performing model amongst the multiple experiments, along with the binary weight file for deploying the model to your application.
Object Detection
DetectNet V2
EfficientDet
FasterRCNN
YoloV3
YoloV4
YoloV4_Tiny
SSD
Retinanet
Image Segmentation
Mask RCNN (Instance Segmentation)
UNet (Semantic Segmentation)
Classification
Multi-class classification
Multi-task classification
Special Use Case Models
License Plate recognition
More tasks and models supported by TAO toolkit, such as Conversational AI applications and Emotion-Detection, will be added in future releases.
Before using AutoML, you need to deploy the TAO REST API service using the TAO Toolkit API Setup steps.
The Hardware and Software requirements mentioned in the TAO Toolkit API Setup section are also applicable to AutoML.
After deploying the TAO Toolkit API, you can use the following commands on the host machine to obtain the node
ip_address
andport_number
for use in the notebooks:ip_address:
hostname -i
port_number:
kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
The TAO Toolkit API provides two types of notebook to interact with the deployed AutoML service:
TAO-API Normal AutoML notebook
TAO-API Client AutoML notebook, which allows users to run commands from the command line interface
Once a notebook is available on your computer, you can run the notebook with the following:
cd <path_to_your_notebook>
jupyter-lab --ip 0.0.0.0 --allow-root --port 8888
We recommend running only one AutoML notebook at a time. Each experiment within AutoML is a separate Kubernetes job, and Kubernetes performs scheduling based on jobs submitted to the queue. Hence, running multiple models across notebooks will fill the queue with AutoML experiments of different models, extending the execution time to complete a single AutoML experiment.
[Mandatory] User Inputs
For each notebook, you need to provide/modify the following parameters:
model_name
: The list of supported network models will be present in the notebook cell.dataset path
: For each task, the dataset should be in a certain folder structure (provided in the dataset description in the notebooks).automl_algorithm
: The AutoML algorithm options to use. A brief explanation of the algorithms can be found in the AutoML Algorithm Explanation section.Bayesian
: Adaptively proposes new hyperparameters based on past experiments. This algorithm generally provides better accuracy results, but with a longer execution time for all experiments, when compared to Hyperband.Hyperband
: Hyperband accelerates the expensive hyperparameter optimization using random search with adaptive resource allocation.
[Optional] User Inputs
For each model, TAO has set a default set of parameters to run the AutoML search. You can add new
valid parameters via additional_automl_parameters
or remove some parameters from the default list
via remove_default_automl_parameters`.
Each notebook contains hyperlinks in the Set AutoML related configurations
cell to view the valid
list of parameters of a particular model:
additional_automl_parameters
: Add additional parameters to the AutoML search algorithm using this list of strings (e.g.additional_automl_parameters = ['parameter1','parameter2']
).Any parameter in the hyperlink table that does not have the
automl_enabled
column set toTrue
/False
can be added to the AutoML search space.For example, consider the list of DetectNet_v2 parameters:
dataset_config.target_class_mapping
can’t be added to theadditional_automl_parameters
list as it is not eligible to be included in the search space.training_config.regularizer.weight
doesn’t make sense to add to this list, as it is already enabled.augmentation_config.preprocessing.output_image_width
can be added, as theautoml_enabled
column is set to neitherTrue
norFalse
.
remove_default_automl_parameters
: Remove parameters that are enabled by default for AutoML search (e.g.remove_default_automl_parameters = ['parameter1','parameter2']
).Any parameter in the hyperlink table that has the
automl_enabled
column set toTrue
can be removed from the AutoML search space.For example, in the list of DetectNet_v2 parameters,
training_config.regularizer.weight
can be removed from the AutoML search space.
[Optional] AutoML Algorithm-Specific Parameters
There are some algorithm-specific parameters set to default values that determine the AutoML experiment schedules. You can optionally modify them as follows:
Bayesian
max_recommendations
: The maximum number of full-scale training experiments to run. The default value is 20.For example, setting this value to 10 will run 10 training experiments in sequential order, as the training config file of nth experiment is computed from the (n-1)th experiment. At the end of 10 experiments, the algorithm returns the training config file and binary weights to the experiment that achieved the best accuracy.
In the API notebook, enable this as follows:
automl_max_recommendations:number_of_recommendations you want to set (integer)
inside theautoml_information
dictionary variable in theSet AutoML related configurations
cell.
In the CLI notebook, enable this as follows:
metadata["automl_max_recommendations"]:number_of_recommendations you want to set (integer)
in theSet AutoML related configurations
cell.
Hyperband
R
: The maximum resources (i.e. the number of recommendations and maximum epochs)The default value is set to 27 and is adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.
In the API notebook, enable this as follows:
automl_R: integer value
inside automl_information dictionary variable in theSet AutoML related configurations
cell
In the CLI notebook, enable this as follows:
metadata["automl_R"]: integer value
in theSet AutoML related configurations
cell.
Nu
: The proportion of recommendations discarded in each use of the SuccessiveHalving algorithm. The default value is 3.In the API notebook, enable this as follows:
automl_nu::value of nu you want to set (integer)
inside theautoml_information
dictionary variable in theSet AutoML related configurations
cell.
In the CLI notebook, enable this as follows:
metadata["automl_nu"]::value of nu you want to set (integer)
in theSet AutoML related configurations
cell.
epoch_multiplier
: The number of epochs, determined by multiplying this value with the resources (R
). The default value is 10 and can be adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.In the API notebook, enable this as follows:
epoch_multiplier:value of epoch_multiplier you want to set (integer)
inside theautoml_information
dictionary variable in theSet AutoML related configurations
cell.
In the CLI notebook, enable this as follows:
metadata["epoch_multiplier"]:value of epoch_multiplier you want to set (integer)
in theSet AutoML related configurations
cell.
The values of R and nu are computed to determine the number of experiemnts to run within each stage of the Hyperband run and the corresponding number of epochs for each experiment. For example, setting
"R=27, Nu=3, epoch_multiplier=10"
will run three stages of experiments, as described in Table 3 of the AutoML Algorithm Explanation section:The first stage proposes 27 new recommendations to run for 10 epochs. Then, Hyperband keeps the 9 (27/3) best performing recommendations to run another 30 (10*3) epochs, and repeats until one recommendation remains.
The second and third stages respectively propose 9 and 3 new recommendations (based on Table 3) and follow the same procedure.
After all the experiments conclude, Hyperband returns the training config file and binary weights to the experiment that achieved the best accuracy.
Finally, the user can change any training-related spec parameters–such as batch size, learning rate, weight regularizer, and frequency of checkpoints–to be reflected during each experiment. This step is necessary because these kinds of parameters are often dependent on the computing hardware specifications (e.g. GPU memory size).
After changing the above parameters, you can run the cells of the notebook. While running the notebook, some cells
take time to complete execution. In the API notebook, the cells poll the status of the job every 15 seconds. The
cell execution is completed only when the status switches to Done
or Error
.
For the AutoML train
cell, you can see the following status indicators during cell execution. For the TAO-Client
AutoML notebook, you can either view these indicators or view the training logs of the experiments. By default, viewing
status indicators is enabled; you can switch to viewing logs by setting poll_automl_stats = True
The current experiment AutoML is executing
The total number of epochs across different experiments still pending
An approximate estimated time remaining for AutoML experiment completion. Note that this is just the training time, in addition to the following times, which are dependent on the config values you set:
The time for evaluation at every n epochs
The time for miscellaneous actions like saving a checkpoint at every n epochs
The time for loading the model from disk for resuming training in Hyperband
The best accuracy value until the current experiment
Resultant Files
After downloading the job contents of AutoML experiments, the folder uses the following structure:
├── automl_metadata.json
├── brain.json
├── controller.json
│ ...............
├── best_model
│ ├── log.txt
│ ├── recommendation_5.kitti
│ ├── status.json
│ ├── .....
│ └── weights
│ └── weights.tlt
│ ...............
├── experiment_0
│ ├── log.txt
│ ├── status.json
│ ├── .....
│ ...............
├── recommendation_0.kitti
├── recommendation_1.kitti
├── recommendation_2.kitti
..............
controller.json
: Summarizes all experiment details regarding the results, target hyperparam values, and status (success or failure).best_model
: Contains the best performing model outcomes (specs, logs, weights) out of all experiment sets in the AutoML runs. Note that the weight file (.tlt
format) is located in either thebest_model
orbest_model/weights
folder, depending on the network model.experiment_n
: Contains the contents (specs, logs) of the nth experiment model outcomes in AutoML.recommendtaions_n.kitti
: Contains the entire spec information for nth experiments in AutoML.
Results of AutoML experiments
The below table provides several experimental results of network models on different AI applications.
These numbers are profiled with the default AutoML specific parameters:
max_recommendation=20
for Bayesian andR=27, nu=3
for Hyperband. For some network models (efficientdet, mask_rcnn, multitask_classification), the defaultautoml
parameter is adjusted by the Hyperband Parameter Auto-adjustment Mechanism to beR=9, nu=3
for Hyperband.All experiments are executed on a single GPU (Tesla V100) with Intel Xeon CPU E5-2698 v4.
When using multi-GPU mode, the time taken is expected to scale accordingly; similarly, using more powerful GPUs will reduce the execution time.
Table1. The estimated time for an AutoML (Bayesian/Hyperband) experiment for each network model
model |
epoch |
dataset |
single experiment |
Bayesian |
Hyperband |
---|---|---|---|---|---|
detectnet_v2 |
80 |
FLIR20 |
45.7min |
913.2min |
534.2min |
efficientdet |
6 |
FLIR20 |
82min |
1640min |
410min |
faster_rcnn |
80 |
FLIR20 |
252min |
5040 min |
2948.4min |
retinanet |
100 |
FLIR20 |
75min |
1500min |
702min |
ssd |
80 |
FLIR20 |
174min |
3480min |
2035min |
yolo3 |
80 |
FLIR20 |
60min |
1200min |
702min |
yolo4 |
80 |
FLIR20 |
110min |
2200min |
1287min |
yolo4_tiny |
80 |
FLIR20 |
87min |
1740min |
1017.9min |
mask_rcnn |
5 |
FLIR20 |
66min |
1320min |
237.6min |
lprnet |
24 |
OpenALPR |
1.1min |
22min |
39.2min |
multiclass_classification |
80 |
Pascal VOC |
80min |
1600min |
1279.7min |
multitask_classification |
10 |
Fashion Product |
12.5min |
250min |
799.8min |
unet |
50 |
ISBI |
4.5min |
90min |
98.5min |
This section gives additional details of two AutoML algorithms.
Bayesian Optimization
Bayesian optimization aims to identify optimal configurations more quickly than standard baselines (such as standard random search) by adaptively selecting hyperparameters based on past experimental information.
Use a surrogate model to fit the Gaussian Process (GP) to existing data (X,y), where X is a vector of recommendations and y is the observed validation map.
Bayesian Optimization adaptively proposes new recommendations based on the fitted GP that produces the best improvement in expectation until reaching the number of
max_recommendations
.
Hyperband
Hyperband addresses the issue of expensive hyperparameter optimization by speeding up random search through adaptive
resource allocation. Hyperband follows the SuccessiveHalving algorithm to uniformly allocate a budget to a
set of hyperparameter recommendations. During AutoML runs, Hyperband evaluates the performance of all recommendations,
throws out the worst half, and repeats this process until one recommendation remains (see this research paper
for more details). Three hyperband-related parameters (R, nu, epoch_multiplier
) pre-define a hyperband
sequence of trials based on how many trials to perform in each round and how many resources to give to each trial in
that round.
The
R
parameter determines the maximum resources (i.e. the number of epochs and recommendations).The
Nu
parameter controls the proportion of recommendations discarded in each use of the SuccessiveHalving algorithm.The
epoch_multiplier
and the resources (r
) are multiplied in eachSuccessiveHalving iterations(i)
to determine the epoch number of each trial. Each run of SuccessiveHalving is referred to as astage
, and each stage containsSuccessiveHalving iterations(i)
.
The below tables describe the SuccessiveHalving iterations(i)
, number of hyperparameter recommendation(n)
,
and resources(r)
given the R
and nu
.
Table2. The pre-defined Hyperband experiment schedule when set with R = 81, nu=3
.
s=0 |
s=1 |
s=2 |
s=3 |
|||||
---|---|---|---|---|---|---|---|---|
i |
n |
r |
n |
r |
n |
r |
n |
r |
0 |
81 |
1 |
27 |
3 |
9 |
9 |
6 |
27 |
1 |
27 |
3 |
9 |
9 |
3 |
27 |
2 |
81 |
2 |
9 |
9 |
3 |
27 |
1 |
81 |
||
3 |
3 |
27 |
1 |
81 |
||||
4 |
1 |
81 |
Table3. The pre-defined Hyperband experiment schedule when set with R = 27, nu=3
.
s=0 |
s=1 |
s=2 |
||||
---|---|---|---|---|---|---|
i |
n |
r |
n |
r |
n |
r |
0 |
27 |
1 |
9 |
3 |
6 |
9 |
1 |
9 |
3 |
3 |
9 |
2 |
27 |
2 |
3 |
9 |
1 |
27 |
||
3 |
1 |
27 |
By default, the Hyperband pamameter R
is set to 27 and nu
is set to 3. Based on
these values and the training max_epoch
of each experiment, the hyperband parameter
auto-adjustment mechanism calculates the epoch_multiplier
by max_epoch^(log(nu)/log(R))
.
Here, if the calculated epoch_multiplier
is less than 3, then R
is reduced by 1 and
the epoch_multiplier
is recalculated until this value is equal to or greater than 3.