AutoML#
AutoML is a TAO API service that automatically selects deep learning hyperparameters for a chosen model and dataset. The TAO API provides a Jupyter notebook interface to try the AutoML feature.
Each AutoML run contains multiple training experiments. At the end of an AutoML run, you can access the config containing the hyperparameters of the best performing model amongst the multiple experiments, along with the binary weight file for deploying the model to your application.
AutoML supports all TAO models except the MAL - Auto Labeling model.
AutoML Algorithm Explanation#
This section gives additional details of two AutoML algorithms.
Bayesian Optimization#
Bayesian optimization aims to identify optimal configurations more quickly than standard baselines (such as standard random search) by adaptively selecting hyperparameters based on past experimental information.
Use a surrogate model to fit the Gaussian Process (GP) to existing data (X,y), where X is a vector of recommendations and y is the observed validation map.
Bayesian Optimization adaptively proposes new recommendations based on the fitted GP that produces the best improvement in expectation until reaching the number of
max_recommendations.
Hyperband#
Hyperband addresses the issue of expensive hyperparameter optimization by speeding up random search through adaptive
resource allocation. Hyperband follows the SuccessiveHalving algorithm to uniformly allocate a budget to a
set of hyperparameter recommendations. During AutoML runs, Hyperband evaluates the performance of all recommendations,
throws out the worst half, and repeats this process until one recommendation remains (see this research paper
for more details). Three hyperband-related parameters (
R, nu, epoch_multiplier) pre-define a hyperband
sequence of trials based on how many trials to perform in each round and how many resources to give to each trial in
that round.
The
Rparameter determines the maximum resources (i.e. the number of epochs and recommendations).
The
Nuparameter controls the proportion of recommendations discarded in each use of the SuccessiveHalving algorithm.
The
epoch_multiplierand the resources (
r) are multiplied in each
SuccessiveHalving iterations(i)to determine the epoch number of each trial. Each run of SuccessiveHalving is referred to as a
stage, and each stage contains
SuccessiveHalving iterations(i).
The below tables describe the
SuccessiveHalving iterations(i),
number of hyperparameter recommendation(n),
and
resources(r) given the
R and
nu.
Table2. The pre-defined Hyperband experiment schedule when set with
R = 81, nu=3.
|
s=0
|
s=1
|
s=2
|
s=3
|
i
|
n
|
r
|
n
|
r
|
n
|
r
|
n
|
r
|
0
|
81
|
1
|
27
|
3
|
9
|
9
|
6
|
27
|
1
|
27
|
3
|
9
|
9
|
3
|
27
|
2
|
81
|
2
|
9
|
9
|
3
|
27
|
1
|
81
|
3
|
3
|
27
|
1
|
81
|
4
|
1
|
81
Table3. The pre-defined Hyperband experiment schedule when set with
R = 27, nu=3.
|
s=0
|
s=1
|
s=2
|
i
|
n
|
r
|
n
|
r
|
n
|
r
|
0
|
27
|
1
|
9
|
3
|
6
|
9
|
1
|
9
|
3
|
3
|
9
|
2
|
27
|
2
|
3
|
9
|
1
|
27
|
3
|
1
|
27
Hyperband Parameter Auto-adjustment Mechanism#
By default, the Hyperband pamameter
R is set to 27 and
nu is set to 3. Based on
these values and the training
max_epoch of each experiment, the hyperband parameter
auto-adjustment mechanism calculates the
epoch_multiplier by
max_epoch^(log(nu)/log(R)).
Here, if the calculated
epoch_multiplier is less than 3, then
R is reduced by 1 and
the
epoch_multiplier is recalculated until this value is equal to or greater than 3.
Prerequisites#
Before using AutoML, you need to deploy the TAO REST API service using the TAO API Setup steps.
The Hardware and Software requirements mentioned in the TAO API Setup section are also applicable to AutoML.
-
After deploying the TAO API, you can use the following commands on the host machine to obtain the node
ip_addressand
port_numberfor use in the notebooks:
ip_address:
hostname -i
port_number:
kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
-
AutoML Notebooks#
AutoML supports all the notebooks that support the training functionality–except for MAL - Auto Labeling.
You have to change automl_enabled variable to True at the beginning and run through the notebook
Once a notebook is available on your computer, you can run the notebook with the following:
cd <path_to_your_notebook>
jupyter-lab --ip 0.0.0.0 --allow-root --port 8888
Note
We recommend running only one AutoML notebook at a time. Each experiment within AutoML is a separate Kubernetes job, and Kubernetes performs scheduling based on jobs submitted to the queue. Hence, running multiple models across notebooks will fill the queue with AutoML experiments of different models, extending the execution time to complete a single AutoML experiment.
Getting Started#
[Mandatory] User Inputs#
For each notebook, you need to provide/modify the following parameters:
model_name: The list of supported network models will be present in the notebook cell.
dataset path: For each task, the dataset should be in a certain folder structure (provided in the dataset description in the notebooks).
automl_algorithm: The AutoML algorithm options to use. A brief explanation of the algorithms can be found in the AutoML Algorithm Explanation section.
Bayesian: Adaptively proposes new hyperparameters based on past experiments. This algorithm generally provides better accuracy results, but with a longer execution time for all experiments, when compared to Hyperband.
Hyperband: Hyperband accelerates the expensive hyperparameter optimization using random search with adaptive resource allocation.
-
[Optional] User Inputs#
For each model, TAO has set a default set of parameters to run the AutoML search. You can add new
valid parameters via
additional_automl_parameters or remove some parameters from the default list
via remove_default_automl_parameters`.
Each notebook contains hyperlinks in the
Set AutoML related configurations cell to view the valid
list of parameters of a particular model:
additional_automl_parameters: Add additional parameters to the AutoML search algorithm using this list of strings (e.g.
additional_automl_parameters = ['parameter1','parameter2']).
Any parameter in the hyperlink table that does not have the
automl_enabledcolumn set to
True/
Falsecan be added to the AutoML search space.
For example, for DetectNet_v2:
dataset_config.target_class_mappingcan’t be added to the
additional_automl_parameterslist as it is not eligible to be included in the search space.
training_config.regularizer.weightdoesn’t make sense to add to this list, as it is already enabled.
augmentation_config.preprocessing.output_image_widthcan be added, as the
automl_enabledcolumn is set to neither
Truenor
False.
-
remove_default_automl_parameters: Remove parameters that are enabled by default for AutoML search (e.g.
remove_default_automl_parameters = ['parameter1','parameter2']).
Any parameter in the hyperlink table that has the
automl_enabledcolumn set to
Truecan be removed from the AutoML search space.
For example, for DetectNet_v2,
training_config.regularizer.weightcan be removed from the AutoML search space.
Weighted List Options
For parameters with discrete list options (
valid_options), you can specify weights to prioritize certain options over others during AutoML search. This is particularly useful when you know certain hyperparameter values are more likely to perform well.
option_weights: Assign weights to list options to control their sampling probability. The weights must be positive numbers and their length must match the number of valid options.
For example, if a parameter has
valid_options: [0.1, 0.5, 1.0, 2.0]and you want to favor the middle values, you can set:
custom_automl_ranges = { "parameter_name": { "option_weights": [0.1, 0.4, 0.4, 0.1] # Higher weights = higher probability } }
The weights are automatically normalized, so
[1, 2, 2, 1]is equivalent to
[0.1, 0.4, 0.4, 0.1].
[Optional] AutoML Algorithm-Specific Parameters#
There are some algorithm-specific parameters set to default values that determine the AutoML experiment schedules. You can optionally modify them as follows:
Bayesian
max_recommendations: The maximum number of full-scale training experiments to run. The default value is 20.
For example, setting this value to 10 will run 10 training experiments in sequential order, as the training config file of nth experiment is computed from the (n-1)th experiment. At the end of 10 experiments, the algorithm returns the training config file and binary weights to the experiment that achieved the best accuracy.
In the API notebook, enable this as follows:
automl_max_recommendations:number_of_recommendations you want to set (integer)inside the
automl_informationdictionary variable in the
Set AutoML related configurationscell.
-
In the CLI notebook, enable this as follows:
metadata["automl_max_recommendations"]:number_of_recommendations you want to set (integer)in the
Set AutoML related configurationscell.
-
-
Hyperband
R: The maximum resources (i.e. the number of recommendations and maximum epochs)
The default value is set to 27 and is adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.
In the API notebook, enable this as follows:
automl_R: integer valueinside automl_information dictionary variable in the
Set AutoML related configurationscell
-
In the CLI notebook, enable this as follows:
metadata["automl_R"]: integer valuein the
Set AutoML related configurationscell.
-
-
Nu: The proportion of recommendations discarded in each use of the SuccessiveHalving algorithm. The default value is 3.
In the API notebook, enable this as follows:
automl_nu::value of nu you want to set (integer)inside the
automl_informationdictionary variable in the
Set AutoML related configurationscell.
-
In the CLI notebook, enable this as follows:
metadata["automl_nu"]::value of nu you want to set (integer)in the
Set AutoML related configurationscell.
-
-
epoch_multiplier: The number of epochs, determined by multiplying this value with the resources (
R). The default value is 10 and can be adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.
In the API notebook, enable this as follows:
epoch_multiplier:value of epoch_multiplier you want to set (integer)inside the
automl_informationdictionary variable in the
Set AutoML related configurationscell.
-
In the CLI notebook, enable this as follows:
metadata["epoch_multiplier"]:value of epoch_multiplier you want to set (integer)in the
Set AutoML related configurationscell.
-
The values of R and nu are computed to determine the number of experiemnts to run within each stage of the Hyperband run and the corresponding number of epochs for each experiment. For example, setting
"R=27, Nu=3, epoch_multiplier=10"will run three stages of experiments, as described in Table 3 of the AutoML Algorithm Explanation section:
The first stage proposes 27 new recommendations to run for 10 epochs. Then, Hyperband keeps the 9 (27/3) best performing recommendations to run another 30 (10*3) epochs, and repeats until one recommendation remains.
The second and third stages respectively propose 9 and 3 new recommendations (based on Table 3) and follow the same procedure.
After all the experiments conclude, Hyperband returns the training config file and binary weights to the experiment that achieved the best accuracy.
-
Finally, the user can change any training-related spec parameters–such as batch size, learning rate, weight regularizer, and frequency of checkpoints–to be reflected during each experiment. This step is necessary because these kinds of parameters are often dependent on the computing hardware specifications (e.g. GPU memory size).
After changing the above parameters, you can run the cells of the notebook. While running the notebook, some cells
take time to complete execution. In the API notebook, the cells poll the status of the job every 15 seconds. The
cell execution is completed only when the status switches to
Done or
Error.
AutoML Outcomes#
For the
AutoML train cell, you can see the following status indicators during cell execution. For the TAO-Client
AutoML notebook, you can either view these indicators or view the training logs of the experiments. By default, viewing
status indicators is enabled; you can switch to viewing logs by setting
poll_automl_stats = True
{
"c9efeab0-0756-47d5-b6ae-3b51e2c02b34": {
"detailed_status": {
"message": "AutoML run is successful with best checkpoints under /results/c9efeab0-0756-47d5-b6ae-3b51e2c02b34",
"status": "SUCCESS"
}
},
"ea7bba9f-c88f-41ba-8b44-75b9c6fc1ec5": {
"detailed_status": {
"date": "5/6/2025",
"message": "train action completed successfully for segformer",
"status": "SUCCESS",
"time": "15:36:29"
},
"epoch": 9,
"eta": "0:00:00",
"kpi": [
{
"metric": "train_loss",
"values": {
"0": 0.7115289568901062,
"9": 0.3484381139278412
}
},
{
"metric": "val_loss",
"values": {
"8": 0.3261469602584839,
"9": 0.3261469602584839
}
},
{
"metric": "val_miou",
"values": {
"8": 0.5810916124069727,
"9": 0.5810916124069727
}
},
],
"max_epoch": 9,
"time_per_epoch": "0:00:01.477164",
}
}
Understanding the AutoML Status#
The status response provides detailed information about the AutoML run and its experiments:
AutoML Brain ID (e.g.,
c9efeab0-0756-47d5-b6ae-3b51e2c02b34): This is a unique identifier for the AutoML orchestrator that manages the entire AutoML run. The status shows the overall run information, including where the best checkpoints are stored.
Experiment ID (e.g.,
ea7bba9f-c88f-41ba-8b44-75b9c6fc1ec5): Each experiment within the AutoML run has its own unique identifier. The AutoML brain runs multiple experiments with different hyperparameter configurations, each with its own UUID.
Detailed Status: Contains information about the experiment’s current state:
message: Descriptive information about the experiment status
status: Current state (e.g., “SUCCESS”)
dateand
time: When the status was last updated
-
Training Progress:
epoch: Current training epoch
max_epoch: Total number of epochs for this experiment
eta: Estimated time remaining for the current experiment
time_per_epoch: Average time taken per epoch
-
Performance Metrics (
kpi): Lists various metrics tracked during training:
train_loss: Training loss values at different epochs
val_loss: Validation loss values
val_miou: Validation mean intersection over union (for segmentation tasks)
Other metrics specific to the model type
-
This status information helps you monitor the progress of your AutoML run, track the performance of individual experiments, and understand how the hyperparameter optimization is proceeding.
Note
The eta for AutoML experiment completion, that is just the training time, in addition to the following times, which are dependent on the config values you set:
The time for evaluation at every n epochs
The time for miscellaneous actions like saving a checkpoint at every n epochs
The time for loading the model from disk for resuming training in Hyperband
Results of AutoML experiments#
The below table provides several experimental results of network models on different AI applications.
These numbers are profiled with the default AutoML specific parameters:
max_recommendation=20for Bayesian and
R=27, nu=3for Hyperband. For some network models (efficientdet, mask_rcnn, multitask_classification), the default
automlparameter is adjusted by the Hyperband Parameter Auto-adjustment Mechanism to be
R=9, nu=3for Hyperband.
All experiments are executed on a single GPU (Tesla V100) with Intel Xeon CPU E5-2698 v4.
When using multi-GPU mode, the time taken is expected to scale accordingly; similarly, using more powerful GPUs will reduce the execution time.
Table1. The estimated time for an AutoML (Bayesian/Hyperband) experiment for each network model
|
model
|
epoch
|
dataset
|
single experiment
|
Bayesian
|
Hyperband
|
detectnet_v2
|
80
|
FLIR20
|
45.7min
|
913.2min
|
534.2min
|
efficientdet
|
6
|
FLIR20
|
82min
|
1640min
|
410min
|
faster_rcnn
|
80
|
FLIR20
|
252min
|
5040 min
|
2948.4min
|
retinanet
|
100
|
FLIR20
|
75min
|
1500min
|
702min
|
ssd
|
80
|
FLIR20
|
174min
|
3480min
|
2035min
|
yolo3
|
80
|
FLIR20
|
60min
|
1200min
|
702min
|
yolo4
|
80
|
FLIR20
|
110min
|
2200min
|
1287min
|
yolo4_tiny
|
80
|
FLIR20
|
87min
|
1740min
|
1017.9min
|
mask_rcnn
|
5
|
FLIR20
|
66min
|
1320min
|
237.6min
|
lprnet
|
24
|
OpenALPR
|
1.1min
|
22min
|
39.2min
|
multiclass_classification
|
80
|
Pascal VOC
|
80min
|
1600min
|
1279.7min
|
multitask_classification
|
10
|
Fashion Product
|
12.5min
|
250min
|
799.8min
|
unet
|
50
|
ISBI
|
4.5min
|
90min
|
98.5min