TAO Toolkit v5.3.0
NVIDIA TAO v5.3.0

AutoML

AutoML is a TAO Toolkit API service that automatically selects deep learning hyperparameters for a chosen model and dataset. The TAO Toolkit API provides a Jupyter notebook interface to try the AutoML feature.

Each AutoML run contains multiple training experiments. At the end of an AutoML run, you can access the config containing the hyperparameters of the best performing model amongst the multiple experiments, along with the binary weight file for deploying the model to your application.

AutoML supports all TAO models except the MAL - Auto Labeling model.

  • Before using AutoML, you need to deploy the TAO REST API service using the TAO Toolkit API Setup steps.

    • The Hardware and Software requirements mentioned in the TAO Toolkit API Setup section are also applicable to AutoML.

  • After deploying the TAO Toolkit API, you can use the following commands on the host machine to obtain the node ip_address and port_number for use in the notebooks:

    • ip_address: hostname -i

    • port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

AutoML supports all the notebooks that support the training functionality–except for MAL - Auto Labeling.

You have to change automl_enabled variable to True at the beginning and run through the notebook

Once a notebook is available on your computer, you can run the notebook with the following:

Copy
Copied!
            

cd <path_to_your_notebook> jupyter-lab --ip 0.0.0.0 --allow-root --port 8888

Note

We recommend running only one AutoML notebook at a time. Each experiment within AutoML is a separate Kubernetes job, and Kubernetes performs scheduling based on jobs submitted to the queue. Hence, running multiple models across notebooks will fill the queue with AutoML experiments of different models, extending the execution time to complete a single AutoML experiment.

[Mandatory] User Inputs

For each notebook, you need to provide/modify the following parameters:

  • model_name: The list of supported network models will be present in the notebook cell.

  • dataset path: For each task, the dataset should be in a certain folder structure (provided in the dataset description in the notebooks).

  • automl_algorithm: The AutoML algorithm options to use. A brief explanation of the algorithms can be found in the AutoML Algorithm Explanation section.

    • Bayesian: Adaptively proposes new hyperparameters based on past experiments. This algorithm generally provides better accuracy results, but with a longer execution time for all experiments, when compared to Hyperband.

    • Hyperband: Hyperband accelerates the expensive hyperparameter optimization using random search with adaptive resource allocation.

[Optional] User Inputs

For each model, TAO has set a default set of parameters to run the AutoML search. You can add new valid parameters via additional_automl_parameters or remove some parameters from the default list via remove_default_automl_parameters`.

Each notebook contains hyperlinks in the Set AutoML related configurations cell to view the valid list of parameters of a particular model:

  • additional_automl_parameters: Add additional parameters to the AutoML search algorithm using this list of strings (e.g. additional_automl_parameters = ['parameter1','parameter2']).

    Any parameter in the hyperlink table that does not have the automl_enabled column set to True/False can be added to the AutoML search space.

    For example, for DetectNet_v2:

    • dataset_config.target_class_mapping can’t be added to the additional_automl_parameters list as it is not eligible to be included in the search space.

    • training_config.regularizer.weight doesn’t make sense to add to this list, as it is already enabled.

    • augmentation_config.preprocessing.output_image_width can be added, as the automl_enabled column is set to neither True nor False.

  • remove_default_automl_parameters: Remove parameters that are enabled by default for AutoML search (e.g. remove_default_automl_parameters = ['parameter1','parameter2']).

    Any parameter in the hyperlink table that has the automl_enabled column set to True can be removed from the AutoML search space.

    For example, for DetectNet_v2, training_config.regularizer.weight can be removed from the AutoML search space.

[Optional] AutoML Algorithm-Specific Parameters

There are some algorithm-specific parameters set to default values that determine the AutoML experiment schedules. You can optionally modify them as follows:

Bayesian

  • max_recommendations: The maximum number of full-scale training experiments to run. The default value is 20.

    For example, setting this value to 10 will run 10 training experiments in sequential order, as the training config file of nth experiment is computed from the (n-1)th experiment. At the end of 10 experiments, the algorithm returns the training config file and binary weights to the experiment that achieved the best accuracy.

    • In the API notebook, enable this as follows:

      • automl_max_recommendations:number_of_recommendations you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.

    • In the CLI notebook, enable this as follows:

      • metadata["automl_max_recommendations"]:number_of_recommendations you want to set (integer) in the Set AutoML related configurations cell.

Hyperband

  • R: The maximum resources (i.e. the number of recommendations and maximum epochs)

    • The default value is set to 27 and is adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.

    • In the API notebook, enable this as follows:

      • automl_R: integer value inside automl_information dictionary variable in the Set AutoML related configurations cell

    • In the CLI notebook, enable this as follows:

      • metadata["automl_R"]: integer value in the Set AutoML related configurations cell.

  • Nu: The proportion of recommendations discarded in each use of the SuccessiveHalving algorithm. The default value is 3.

    • In the API notebook, enable this as follows:

      • automl_nu::value of nu you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.

    • In the CLI notebook, enable this as follows:

      • metadata["automl_nu"]::value of nu you want to set (integer) in the Set AutoML related configurations cell.

  • epoch_multiplier: The number of epochs, determined by multiplying this value with the resources (R). The default value is 10 and can be adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.

    • In the API notebook, enable this as follows:

      • epoch_multiplier:value of epoch_multiplier you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.

    • In the CLI notebook, enable this as follows:

      • metadata["epoch_multiplier"]:value of epoch_multiplier you want to set (integer) in the Set AutoML related configurations cell.

    The values of R and nu are computed to determine the number of experiemnts to run within each stage of the Hyperband run and the corresponding number of epochs for each experiment. For example, setting "R=27, Nu=3, epoch_multiplier=10" will run three stages of experiments, as described in Table 3 of the AutoML Algorithm Explanation section:

    • The first stage proposes 27 new recommendations to run for 10 epochs. Then, Hyperband keeps the 9 (27/3) best performing recommendations to run another 30 (10*3) epochs, and repeats until one recommendation remains.

    • The second and third stages respectively propose 9 and 3 new recommendations (based on Table 3) and follow the same procedure.

    • After all the experiments conclude, Hyperband returns the training config file and binary weights to the experiment that achieved the best accuracy.

Finally, the user can change any training-related spec parameters–such as batch size, learning rate, weight regularizer, and frequency of checkpoints–to be reflected during each experiment. This step is necessary because these kinds of parameters are often dependent on the computing hardware specifications (e.g. GPU memory size).

After changing the above parameters, you can run the cells of the notebook. While running the notebook, some cells take time to complete execution. In the API notebook, the cells poll the status of the job every 15 seconds. The cell execution is completed only when the status switches to Done or Error.

For the AutoML train cell, you can see the following status indicators during cell execution. For the TAO-Client AutoML notebook, you can either view these indicators or view the training logs of the experiments. By default, viewing status indicators is enabled; you can switch to viewing logs by setting poll_automl_stats = True

  • The current experiment AutoML is executing

  • The total number of epochs across different experiments still pending

  • An approximate estimated time remaining for AutoML experiment completion. Note that this is just the training time, in addition to the following times, which are dependent on the config values you set:

    • The time for evaluation at every n epochs

    • The time for miscellaneous actions like saving a checkpoint at every n epochs

    • The time for loading the model from disk for resuming training in Hyperband

  • The best accuracy value until the current experiment

Resultant Files

After downloading the job contents of AutoML experiments, the folder uses the following structure:

Copy
Copied!
            

├── automl_metadata.json ├── brain.json ├── controller.json │ ............... ├── best_model │   ├── log.txt │   ├── recommendation_5.yaml/protobuf │   ├── status.json │   ├── ..... │   └── weights │      └── weights.tlt/.pth/.hdf5 │   ............... ├── experiment_0 │   ├── log.txt │   ├── status.json │   ├── ..... │   ............... ├── recommendation_0.yaml/protobuf ├── recommendation_1.yaml/protobuf ├── recommendation_2.yaml/protobuf ..............

  • controller.json: Summarizes all experiment details regarding the results, target hyperparam values, and status (success or failure).

  • best_model: Contains the best performing model outcomes (specs, logs, weights) out of all experiment sets in the AutoML runs. The weight file (.tlt/.pth/.hdf5 format) is located in the best_model, best_model/train, or best_model/weights folder, depending on the network model.

  • experiment_n: Contains the contents (specs, logs) of the Nth experiment model outcomes in AutoML.

  • recommendtaions_n.yaml/protobuf: Contains the entire spec information for Nth experiments in AutoML.

Results of AutoML experiments

The below table provides several experimental results of network models on different AI applications.

  • These numbers are profiled with the default AutoML specific parameters: max_recommendation=20 for Bayesian and R=27, nu=3 for Hyperband. For some network models (efficientdet, mask_rcnn, multitask_classification), the default automl parameter is adjusted by the Hyperband Parameter Auto-adjustment Mechanism to be R=9, nu=3 for Hyperband.

  • All experiments are executed on a single GPU (Tesla V100) with Intel Xeon CPU E5-2698 v4.

  • When using multi-GPU mode, the time taken is expected to scale accordingly; similarly, using more powerful GPUs will reduce the execution time.

Table1. The estimated time for an AutoML (Bayesian/Hyperband) experiment for each network model

model

epoch

dataset

single experiment

Bayesian

Hyperband

detectnet_v2 80 FLIR20 45.7min 913.2min 534.2min
efficientdet 6 FLIR20 82min 1640min 410min
faster_rcnn 80 FLIR20 252min 5040 min 2948.4min
retinanet 100 FLIR20 75min 1500min 702min
ssd 80 FLIR20 174min 3480min 2035min
yolo3 80 FLIR20 60min 1200min 702min
yolo4 80 FLIR20 110min 2200min 1287min
yolo4_tiny 80 FLIR20 87min 1740min 1017.9min
mask_rcnn 5 FLIR20 66min 1320min 237.6min
lprnet 24 OpenALPR 1.1min 22min 39.2min
multiclass_classification 80 Pascal VOC 80min 1600min 1279.7min
multitask_classification 10 Fashion Product 12.5min 250min 799.8min
unet 50 ISBI 4.5min 90min 98.5min

This section gives additional details of two AutoML algorithms.

Bayesian Optimization

Bayesian optimization aims to identify optimal configurations more quickly than standard baselines (such as standard random search) by adaptively selecting hyperparameters based on past experimental information.

  • Use a surrogate model to fit the Gaussian Process (GP) to existing data (X,y), where X is a vector of recommendations and y is the observed validation map.

  • Bayesian Optimization adaptively proposes new recommendations based on the fitted GP that produces the best improvement in expectation until reaching the number of max_recommendations.

Hyperband

Hyperband addresses the issue of expensive hyperparameter optimization by speeding up random search through adaptive resource allocation. Hyperband follows the SuccessiveHalving algorithm to uniformly allocate a budget to a set of hyperparameter recommendations. During AutoML runs, Hyperband evaluates the performance of all recommendations, throws out the worst half, and repeats this process until one recommendation remains (see this research paper for more details). Three hyperband-related parameters (R, nu, epoch_multiplier) pre-define a hyperband sequence of trials based on how many trials to perform in each round and how many resources to give to each trial in that round.

  • The R parameter determines the maximum resources (i.e. the number of epochs and recommendations).

  • The Nu parameter controls the proportion of recommendations discarded in each use of the SuccessiveHalving algorithm.

  • The epoch_multiplier and the resources (r) are multiplied in each SuccessiveHalving iterations(i) to determine the epoch number of each trial. Each run of SuccessiveHalving is referred to as a stage, and each stage contains SuccessiveHalving iterations(i).

The below tables describe the SuccessiveHalving iterations(i), number of hyperparameter recommendation(n), and resources(r) given the R and nu.

Table2. The pre-defined Hyperband experiment schedule when set with R = 81, nu=3.

s=0

s=1

s=2

s=3

i n r n r n r n r
0 81 1 27 3 9 9 6 27
1 27 3 9 9 3 27 2 81
2 9 9 3 27 1 81
3 3 27 1 81
4 1 81

Table3. The pre-defined Hyperband experiment schedule when set with R = 27, nu=3.

s=0

s=1

s=2

i n r n r n r
0 27 1 9 3 6 9
1 9 3 3 9 2 27
2 3 9 1 27
3 1 27

By default, the Hyperband pamameter R is set to 27 and nu is set to 3. Based on these values and the training max_epoch of each experiment, the hyperband parameter auto-adjustment mechanism calculates the epoch_multiplier by max_epoch^(log(nu)/log(R)). Here, if the calculated epoch_multiplier is less than 3, then R is reduced by 1 and the epoch_multiplier is recalculated until this value is equal to or greater than 3.

Previous API Reference
Next Optimizing the Training Pipeline
© Copyright 2023, NVIDIA.. Last updated on Aug 26, 2024.