NVIDIA Docs Hub NVIDIA TAO TAO Toolkit v5.3.0 AutoML

AutoML

AutoML is a TAO Toolkit API service that automatically selects deep learning hyperparameters for a chosen model and dataset. The TAO Toolkit API provides a Jupyter notebook interface to try the AutoML feature.

Each AutoML run contains multiple training experiments. At the end of an AutoML run, you can access the config containing the hyperparameters of the best performing model amongst the multiple experiments, along with the binary weight file for deploying the model to your application.

AutoML supports all TAO models except the MAL - Auto Labeling model.

Prerequisites

Before using AutoML, you need to deploy the TAO REST API service using the TAO Toolkit API Setup steps.
- The Hardware and Software requirements mentioned in the TAO Toolkit API Setup section are also applicable to AutoML.
After deploying the TAO Toolkit API, you can use the following commands on the host machine to obtain the node ip_address and port_number for use in the notebooks:
- ip_address: hostname -i
- port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

AutoML Notebooks

AutoML supports all the notebooks that support the training functionality–except for MAL - Auto Labeling.

You have to change automl_enabled variable to True at the beginning and run through the notebook

Once a notebook is available on your computer, you can run the notebook with the following:

Copy
Copied!

            
            cd <path_to_your_notebook>
jupyter-lab --ip 0.0.0.0 --allow-root --port 8888

Note

We recommend running only one AutoML notebook at a time. Each experiment within AutoML is a separate Kubernetes job, and Kubernetes performs scheduling based on jobs submitted to the queue. Hence, running multiple models across notebooks will fill the queue with AutoML experiments of different models, extending the execution time to complete a single AutoML experiment.

Getting Started

[Mandatory] User Inputs

For each notebook, you need to provide/modify the following parameters:

model_name: The list of supported network models will be present in the notebook cell.
dataset path: For each task, the dataset should be in a certain folder structure (provided in the dataset description in the notebooks).
automl_algorithm: The AutoML algorithm options to use. A brief explanation of the algorithms can be found in the AutoML Algorithm Explanation section.
- Bayesian: Adaptively proposes new hyperparameters based on past experiments. This algorithm generally provides better accuracy results, but with a longer execution time for all experiments, when compared to Hyperband.
- Hyperband: Hyperband accelerates the expensive hyperparameter optimization using random search with adaptive resource allocation.

[Optional] User Inputs

For each model, TAO has set a default set of parameters to run the AutoML search. You can add new valid parameters via additional_automl_parameters or remove some parameters from the default list via remove_default_automl_parameters`.

Each notebook contains hyperlinks in the Set AutoML related configurations cell to view the valid list of parameters of a particular model:

additional_automl_parameters: Add additional parameters to the AutoML search algorithm using this list of strings (e.g. additional_automl_parameters = ['parameter1','parameter2']).

Any parameter in the hyperlink table that does not have the automl_enabled column set to True/False can be added to the AutoML search space.

For example, for DetectNet_v2:
- dataset_config.target_class_mapping can’t be added to the additional_automl_parameters list as it is not eligible to be included in the search space.
- training_config.regularizer.weight doesn’t make sense to add to this list, as it is already enabled.
- augmentation_config.preprocessing.output_image_width can be added, as the automl_enabled column is set to neither True nor False.
remove_default_automl_parameters: Remove parameters that are enabled by default for AutoML search (e.g. remove_default_automl_parameters = ['parameter1','parameter2']).

Any parameter in the hyperlink table that has the automl_enabled column set to True can be removed from the AutoML search space.

For example, for DetectNet_v2, training_config.regularizer.weight can be removed from the AutoML search space.

[Optional] AutoML Algorithm-Specific Parameters

There are some algorithm-specific parameters set to default values that determine the AutoML experiment schedules. You can optionally modify them as follows:

Bayesian

max_recommendations: The maximum number of full-scale training experiments to run. The default value is 20.

For example, setting this value to 10 will run 10 training experiments in sequential order, as the training config file of nth experiment is computed from the (n-1)th experiment. At the end of 10 experiments, the algorithm returns the training config file and binary weights to the experiment that achieved the best accuracy.
- In the API notebook, enable this as follows:
  - automl_max_recommendations:number_of_recommendations you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.
- In the CLI notebook, enable this as follows:
  - metadata["automl_max_recommendations"]:number_of_recommendations you want to set (integer) in the Set AutoML related configurations cell.

Hyperband

R: The maximum resources (i.e. the number of recommendations and maximum epochs)
- The default value is set to 27 and is adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.
- In the API notebook, enable this as follows:
  - automl_R: integer value inside automl_information dictionary variable in the Set AutoML related configurations cell
- In the CLI notebook, enable this as follows:
  - metadata["automl_R"]: integer value in the Set AutoML related configurations cell.
Nu: The proportion of recommendations discarded in each use of the SuccessiveHalving algorithm. The default value is 3.
- In the API notebook, enable this as follows:
  - automl_nu::value of nu you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.
- In the CLI notebook, enable this as follows:
  - metadata["automl_nu"]::value of nu you want to set (integer) in the Set AutoML related configurations cell.
epoch_multiplier: The number of epochs, determined by multiplying this value with the resources (R). The default value is 10 and can be adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.
- In the API notebook, enable this as follows:
  - epoch_multiplier:value of epoch_multiplier you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.
- In the CLI notebook, enable this as follows:
  - metadata["epoch_multiplier"]:value of epoch_multiplier you want to set (integer) in the Set AutoML related configurations cell.
The values of R and nu are computed to determine the number of experiemnts to run within each stage of the Hyperband run and the corresponding number of epochs for each experiment. For example, setting "R=27, Nu=3, epoch_multiplier=10" will run three stages of experiments, as described in Table 3 of the AutoML Algorithm Explanation section:
- The first stage proposes 27 new recommendations to run for 10 epochs. Then, Hyperband keeps the 9 (27/3) best performing recommendations to run another 30 (10*3) epochs, and repeats until one recommendation remains.
- The second and third stages respectively propose 9 and 3 new recommendations (based on Table 3) and follow the same procedure.
- After all the experiments conclude, Hyperband returns the training config file and binary weights to the experiment that achieved the best accuracy.

Finally, the user can change any training-related spec parameters–such as batch size, learning rate, weight regularizer, and frequency of checkpoints–to be reflected during each experiment. This step is necessary because these kinds of parameters are often dependent on the computing hardware specifications (e.g. GPU memory size).

After changing the above parameters, you can run the cells of the notebook. While running the notebook, some cells take time to complete execution. In the API notebook, the cells poll the status of the job every 15 seconds. The cell execution is completed only when the status switches to Done or Error.

For the AutoML train cell, you can see the following status indicators during cell execution. For the TAO-Client AutoML notebook, you can either view these indicators or view the training logs of the experiments. By default, viewing status indicators is enabled; you can switch to viewing logs by setting poll_automl_stats = True

The current experiment AutoML is executing
The total number of epochs across different experiments still pending
An approximate estimated time remaining for AutoML experiment completion. Note that this is just the training time, in addition to the following times, which are dependent on the config values you set:
- The time for evaluation at every n epochs
- The time for miscellaneous actions like saving a checkpoint at every n epochs
- The time for loading the model from disk for resuming training in Hyperband
The best accuracy value until the current experiment

AutoML Outcomes

Resultant Files

After downloading the job contents of AutoML experiments, the folder uses the following structure:

Copy
Copied!

            
            ├── automl_metadata.json
├── brain.json
├── controller.json
│ ...............
├── best_model
│   ├── log.txt
│   ├── recommendation_5.yaml/protobuf
│   ├── status.json
│   ├── .....
│   └── weights
│       └── weights.tlt/.pth/.hdf5
│   ...............
├── experiment_0
│   ├── log.txt
│   ├── status.json
│   ├── .....
│   ...............
├── recommendation_0.yaml/protobuf
├── recommendation_1.yaml/protobuf
├── recommendation_2.yaml/protobuf
..............

controller.json: Summarizes all experiment details regarding the results, target hyperparam values, and status (success or failure).
best_model: Contains the best performing model outcomes (specs, logs, weights) out of all experiment sets in the AutoML runs. The weight file (.tlt/.pth/.hdf5 format) is located in the best_model, best_model/train, or best_model/weights folder, depending on the network model.
experiment_n: Contains the contents (specs, logs) of the Nth experiment model outcomes in AutoML.
recommendtaions_n.yaml/protobuf: Contains the entire spec information for Nth experiments in AutoML.

Results of AutoML experiments

The below table provides several experimental results of network models on different AI applications.

These numbers are profiled with the default AutoML specific parameters: max_recommendation=20 for Bayesian and R=27, nu=3 for Hyperband. For some network models (efficientdet, mask_rcnn, multitask_classification), the default automl parameter is adjusted by the Hyperband Parameter Auto-adjustment Mechanism to be R=9, nu=3 for Hyperband.
All experiments are executed on a single GPU (Tesla V100) with Intel Xeon CPU E5-2698 v4.
When using multi-GPU mode, the time taken is expected to scale accordingly; similarly, using more powerful GPUs will reduce the execution time.

Table1. The estimated time for an AutoML (Bayesian/Hyperband) experiment for each network model

model	epoch	dataset	single experiment	Bayesian	Hyperband
detectnet_v2	80	FLIR20	45.7min	913.2min	534.2min
efficientdet	6	FLIR20	82min	1640min	410min
faster_rcnn	80	FLIR20	252min	5040 min	2948.4min
retinanet	100	FLIR20	75min	1500min	702min
ssd	80	FLIR20	174min	3480min	2035min
yolo3	80	FLIR20	60min	1200min	702min
yolo4	80	FLIR20	110min	2200min	1287min
yolo4_tiny	80	FLIR20	87min	1740min	1017.9min
mask_rcnn	5	FLIR20	66min	1320min	237.6min
lprnet	24	OpenALPR	1.1min	22min	39.2min
multiclass_classification	80	Pascal VOC	80min	1600min	1279.7min
multitask_classification	10	Fashion Product	12.5min	250min	799.8min
unet	50	ISBI	4.5min	90min	98.5min

AutoML Algorithm Explanation

This section gives additional details of two AutoML algorithms.

Bayesian Optimization

Bayesian optimization aims to identify optimal configurations more quickly than standard baselines (such as standard random search) by adaptively selecting hyperparameters based on past experimental information.

Use a surrogate model to fit the Gaussian Process (GP) to existing data (X,y), where X is a vector of recommendations and y is the observed validation map.
Bayesian Optimization adaptively proposes new recommendations based on the fitted GP that produces the best improvement in expectation until reaching the number of max_recommendations.

Hyperband

Hyperband addresses the issue of expensive hyperparameter optimization by speeding up random search through adaptive resource allocation. Hyperband follows the SuccessiveHalving algorithm to uniformly allocate a budget to a set of hyperparameter recommendations. During AutoML runs, Hyperband evaluates the performance of all recommendations, throws out the worst half, and repeats this process until one recommendation remains (see this research paper for more details). Three hyperband-related parameters (R, nu, epoch_multiplier) pre-define a hyperband sequence of trials based on how many trials to perform in each round and how many resources to give to each trial in that round.

The R parameter determines the maximum resources (i.e. the number of epochs and recommendations).
The Nu parameter controls the proportion of recommendations discarded in each use of the SuccessiveHalving algorithm.
The epoch_multiplier and the resources (r) are multiplied in each SuccessiveHalving iterations(i) to determine the epoch number of each trial. Each run of SuccessiveHalving is referred to as a stage, and each stage contains SuccessiveHalving iterations(i).

The below tables describe the SuccessiveHalving iterations(i), number of hyperparameter recommendation(n), and resources(r) given the R and nu.

Table2. The pre-defined Hyperband experiment schedule when set with R = 81, nu=3.

	s=0		s=1		s=2		s=3
i	n	r	n	r	n	r	n	r
0	81	1	27	3	9	9	6	27
1	27	3	9	9	3	27	2	81
2	9	9	3	27	1	81
3	3	27	1	81
4	1	81

Table3. The pre-defined Hyperband experiment schedule when set with R = 27, nu=3.

	s=0		s=1		s=2
i	n	r	n	r	n	r
0	27	1	9	3	6	9
1	9	3	3	9	2	27
2	3	9	1	27
3	1	27

Hyperband Parameter Auto-adjustment Mechanism

By default, the Hyperband pamameter R is set to 27 and nu is set to 3. Based on these values and the training max_epoch of each experiment, the hyperband parameter auto-adjustment mechanism calculates the epoch_multiplier by max_epoch^(log(nu)/log(R)). Here, if the calculated epoch_multiplier is less than 3, then R is reduced by 1 and the epoch_multiplier is recalculated until this value is equal to or greater than 3.

Previous API Reference

Next Optimizing the Training Pipeline