AutoML#
AutoML is a TAO API service that automatically selects deep learning hyperparameters for a chosen model and dataset. The TAO API provides a Jupyter notebook interface to try the AutoML feature.
Each AutoML run contains multiple training experiments. At the end of an AutoML run, you can access the config containing the hyperparameters of the best performing model amongst the multiple experiments, along with the binary weight file for deploying the model to your application.
AutoML supports all TAO models except the MAL - Auto Labeling model.
AutoML Algorithm Explanation#
This section gives additional details of the supported AutoML algorithms.
Choosing an Algorithm#
If you are unsure which algorithm to pick, use this guidance:
New to AutoML / quick start — use
bayesian. It is the most well-understood algorithm and produces consistently good results with minimal tuning.Limited GPU budget, need speed — use
hyperband. It discards poor trials early and converges faster than a full Bayesian search.Large search space (many hyperparameters) — use
bohbordehb. Both scale better than pure Bayesian approaches in high-dimensional spaces.Multiple GPUs available — use
bfbo(generates a whole batch of recommendations in parallel) orasha(asynchronous, no sync barrier between workers).Want hyperparameters to evolve during training — use
pbt. Unlike other algorithms, PBT can change learning rate or other parameters mid-training, not just between runs.Quick pruning of bad trials — use
hyperband_es. It adds explicit early-stopping thresholds on top of Hyperband to cut off clearly failing trials sooner.
Bayesian Optimization#
Bayesian optimization aims to identify optimal configurations more quickly than standard baselines (such as standard random search) by adaptively selecting hyperparameters based on past experimental information.
Use a surrogate model to fit the Gaussian Process (GP) to existing data (X,y), where X is a vector of recommendations and y is the observed validation map.
Bayesian Optimization adaptively proposes new recommendations based on the fitted GP that produces the best improvement in expectation until reaching the number of
max_recommendations.
Hyperband#
Hyperband addresses the issue of expensive hyperparameter optimization by speeding up random search through adaptive
resource allocation. Hyperband follows the SuccessiveHalving algorithm to uniformly allocate a budget to a
set of hyperparameter recommendations. During AutoML runs, Hyperband evaluates the performance of all recommendations,
throws out the worst half, and repeats this process until one recommendation remains (see this research paper
for more details). Three hyperband-related parameters (R, nu, epoch_multiplier) pre-define a hyperband
sequence of trials based on how many trials to perform in each round and how many resources to give to each trial in
that round.
The
Rparameter determines the maximum resources (i.e. the number of epochs and recommendations).The
Nuparameter controls the proportion of recommendations discarded in each use of the SuccessiveHalving algorithm.The
epoch_multiplierand the resources (r) are multiplied in eachSuccessiveHalving iterations(i)to determine the epoch number of each trial. Each run of SuccessiveHalving is referred to as astage, and each stage containsSuccessiveHalving iterations(i).
The below tables describe the SuccessiveHalving iterations(i), number of hyperparameter recommendation(n),
and resources(r) given the R and nu.
Table2. The pre-defined Hyperband experiment schedule when set with R = 81, nu=3.
s=0 |
s=1 |
s=2 |
s=3 |
|||||
|---|---|---|---|---|---|---|---|---|
i |
n |
r |
n |
r |
n |
r |
n |
r |
0 |
81 |
1 |
27 |
3 |
9 |
9 |
6 |
27 |
1 |
27 |
3 |
9 |
9 |
3 |
27 |
2 |
81 |
2 |
9 |
9 |
3 |
27 |
1 |
81 |
||
3 |
3 |
27 |
1 |
81 |
||||
4 |
1 |
81 |
||||||
Table3. The pre-defined Hyperband experiment schedule when set with R = 27, nu=3.
s=0 |
s=1 |
s=2 |
||||
|---|---|---|---|---|---|---|
i |
n |
r |
n |
r |
n |
r |
0 |
27 |
1 |
9 |
3 |
6 |
9 |
1 |
9 |
3 |
3 |
9 |
2 |
27 |
2 |
3 |
9 |
1 |
27 |
||
3 |
1 |
27 |
||||
BOHB (Bayesian Optimization and Hyperband)#
BOHB combines Bayesian optimization with Hyperband’s early stopping. It builds a statistical model of which hyperparameter regions tend to do well, then uses Hyperband’s resource scheduling to allocate more epochs to the most promising trials. The result is faster convergence than vanilla Bayesian search when you have a large number of hyperparameters.
Key parameters:
automl_kde_samples: Number of completed trials used to fit the performance model. Raise this if the search feels too random early on; lower it to react faster to new results.automl_top_n_percent(default: 15): What fraction of past trials counts as “good” when building the model. Lower = more selective = more focused search.automl_min_points_in_model: How many trials to run randomly before the model kicks in. Useful for very cold starts where you have no data yet.
BFBO (Batch First Bayesian Optimization)#
A variant of Bayesian optimization that generates a full batch of recommendations at once before waiting for results. This means you can fill all your available GPUs immediately rather than waiting for each experiment to finish before picking the next configuration.
Key parameters:
automl_max_recommendations: Total experiments to run (same meaning as thebayesianalgorithm’smax_recommendations).
ASHA (Asynchronous Successive Halving Algorithm)#
ASHA is like Hyperband but without requiring all workers to finish a stage before the next begins. Workers that finish early move straight to the next round, keeping all GPUs busy even when some experiments run faster than others. Best for heterogeneous clusters or when your experiments have very different runtimes.
Key parameters:
automl_max_concurrent: How many trials to run at the same time.automl_max_trials: Total trials before stopping the search.
PBT (Population-Based Training)#
PBT is fundamentally different from the other algorithms: it does not pick hyperparameters once at the start and then train. Instead it runs a group of models in parallel and periodically replaces the worst-performing ones with copies of the best-performing ones, then randomly mutates their hyperparameters. This lets the learning rate, batch size, and other parameters evolve during training — which is especially useful for schedules that are hard to set in advance.
Key parameters:
automl_population_size: Number of parallel models. Larger populations explore more of the hyperparameter space but require more GPUs.automl_eval_interval: How often (in epochs) the population is ranked and the weakest members are replaced.
DEHB (Differential Evolution with Hyperband)#
DEHB pairs Hyperband’s multi-fidelity scheduling (cheap early evaluation, expensive late evaluation) with differential evolution — a mutation-based global optimizer that explores the hyperparameter space by combining existing solutions. It tends to outperform vanilla Hyperband on very large or continuous search spaces.
Key parameters:
automl_mutation_factor: How far a mutant configuration can move from its parent (0–2). Higher values = more exploration; lower values = more exploitation.automl_crossover_prob: Fraction of hyperparameters inherited from the mutant vs. the original (0–1). Higher = mutant-leaning; lower = parent-leaning.
Hyperband with Early Stopping (hyperband_es)#
A drop-in extension of Hyperband that can terminate a trial mid-stage if its validation metric falls below a threshold — even before the stage’s epoch budget is exhausted. This is useful when you know that any run below a certain accuracy at epoch 5 is never going to recover by epoch 20.
Key parameters:
automl_early_stop_threshold: The metric value below which a trial is stopped. Set this based on the minimum acceptable validation accuracy/mIoU/etc. for your task.automl_min_early_stop_epochs: A trial must run at least this many epochs before it can be early-stopped. Prevents killing trials before they have had a chance to warm up.
Hyperband Parameter Auto-adjustment Mechanism#
By default, the Hyperband pamameter R is set to 27 and nu is set to 3. Based on
these values and the training max_epoch of each experiment, the hyperband parameter
auto-adjustment mechanism calculates the epoch_multiplier by max_epoch^(log(nu)/log(R)).
Here, if the calculated epoch_multiplier is less than 3, then R is reduced by 1 and
the epoch_multiplier is recalculated until this value is equal to or greater than 3.
Prerequisites#
Before using AutoML, you need to deploy the TAO REST API service using the TAO API Setup steps.
The Hardware and Software requirements mentioned in the TAO API Setup section are also applicable to AutoML.
After deploying the TAO API, you can use the following commands on the host machine to obtain the node
ip_addressandport_numberfor use in the notebooks:ip_address:
hostname -iport_number:
kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
AutoML Notebooks#
AutoML supports all the notebooks that support the training functionality–except for MAL - Auto Labeling.
You have to change automl_enabled variable to True at the beginning and run through the notebook
Once a notebook is available on your computer, you can run the notebook with the following:
cd <path_to_your_notebook>
jupyter-lab --ip 0.0.0.0 --allow-root --port 8888
Note
We recommend running only one AutoML notebook at a time. Each experiment within AutoML is a separate Kubernetes job, and Kubernetes performs scheduling based on jobs submitted to the queue. Hence, running multiple models across notebooks will fill the queue with AutoML experiments of different models, extending the execution time to complete a single AutoML experiment.
Getting Started#
[Mandatory] User Inputs#
For each notebook, you need to provide/modify the following parameters:
model_name: The list of supported network models will be present in the notebook cell.dataset path: For each task, the dataset should be in a certain folder structure (provided in the dataset description in the notebooks).automl_algorithm: The AutoML algorithm options to use. A brief explanation of the algorithms can be found in the AutoML Algorithm Explanation section.bayesian: Adaptively proposes new hyperparameters based on past experiments. This algorithm generally provides better accuracy results, but with a longer execution time for all experiments, when compared to Hyperband.hyperband: Hyperband accelerates the expensive hyperparameter optimization using random search with adaptive resource allocation.bohb: Combines Bayesian optimization with Hyperband for sample-efficient search.bfbo: Batch First Bayesian Optimization — generates batches of recommendations in parallel.asha: Asynchronous Successive Halving — well-suited for distributed and heterogeneous compute.pbt: Population-Based Training — adapts hyperparameters online during training.dehb: Differential Evolution with Hyperband — effective for large, high-dimensional search spaces.hyperband_es: Hyperband with early stopping — adds per-trial threshold-based early termination.
[Optional] User Inputs#
For each model, TAO has set a default set of parameters to run the AutoML search. You can add new
valid parameters via additional_automl_parameters or remove some parameters from the default list
via remove_default_automl_parameters`.
Each notebook contains hyperlinks in the Set AutoML related configurations cell to view the valid
list of parameters of a particular model:
additional_automl_parameters: Add additional parameters to the AutoML search algorithm using this list of strings (e.g.additional_automl_parameters = ['parameter1','parameter2']).Any parameter in the hyperlink table that does not have the
automl_enabledcolumn set toTrue/Falsecan be added to the AutoML search space.For example, for DetectNet_v2:
dataset_config.target_class_mappingcan’t be added to theadditional_automl_parameterslist as it is not eligible to be included in the search space.training_config.regularizer.weightdoesn’t make sense to add to this list, as it is already enabled.augmentation_config.preprocessing.output_image_widthcan be added, as theautoml_enabledcolumn is set to neitherTruenorFalse.
remove_default_automl_parameters: Remove parameters that are enabled by default for AutoML search (e.g.remove_default_automl_parameters = ['parameter1','parameter2']).Any parameter in the hyperlink table that has the
automl_enabledcolumn set toTruecan be removed from the AutoML search space.For example, for DetectNet_v2,
training_config.regularizer.weightcan be removed from the AutoML search space.
Treating a List Parameter as a Continuous Range
Some parameters have a predefined list of valid values (valid_options). By default, the
AutoML optimizer samples only from those values. If you want the optimizer to treat the
parameter as a continuous float instead — exploring values between and beyond the list items
— set disable_list: True for that parameter:
custom_automl_ranges = {
"train.learning_rate": {
"disable_list": True # Explore as a continuous float, not just the listed values
}
}
This gives the optimizer more freedom and can find better configurations when the predefined list is sparse.
Weighted List Options
For parameters with discrete list options (valid_options), you can specify weights to prioritize certain options over others during AutoML search. This is particularly useful when you know certain hyperparameter values are more likely to perform well.
option_weights: Assign weights to list options to control their sampling probability. The weights must be positive numbers and their length must match the number of valid options.For example, if a parameter has
valid_options: [0.1, 0.5, 1.0, 2.0]and you want to favor the middle values, you can set:custom_automl_ranges = { "parameter_name": { "option_weights": [0.1, 0.4, 0.4, 0.1] # Higher weights = higher probability } }
The weights are automatically normalized, so
[1, 2, 2, 1]is equivalent to[0.1, 0.4, 0.4, 0.1].
[Optional] AutoML Algorithm-Specific Parameters#
There are some algorithm-specific parameters set to default values that determine the AutoML experiment schedules. You can optionally modify them as follows:
Bayesian
max_recommendations: The maximum number of full-scale training experiments to run. The default value is 20.For example, setting this value to 10 will run 10 training experiments in sequential order, as the training config file of nth experiment is computed from the (n-1)th experiment. At the end of 10 experiments, the algorithm returns the training config file and binary weights to the experiment that achieved the best accuracy.
In the API notebook, enable this as follows:
automl_max_recommendations:number_of_recommendations you want to set (integer)inside theautoml_informationdictionary variable in theSet AutoML related configurationscell.
In the CLI notebook, enable this as follows:
metadata["automl_max_recommendations"]:number_of_recommendations you want to set (integer)in theSet AutoML related configurationscell.
Hyperband
R: The maximum resources (i.e. the number of recommendations and maximum epochs)The default value is set to 27 and is adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.
In the API notebook, enable this as follows:
automl_R: integer valueinside automl_information dictionary variable in theSet AutoML related configurationscell
In the CLI notebook, enable this as follows:
metadata["automl_R"]: integer valuein theSet AutoML related configurationscell.
Nu: The proportion of recommendations discarded in each use of the SuccessiveHalving algorithm. The default value is 3.In the API notebook, enable this as follows:
automl_nu::value of nu you want to set (integer)inside theautoml_informationdictionary variable in theSet AutoML related configurationscell.
In the CLI notebook, enable this as follows:
metadata["automl_nu"]::value of nu you want to set (integer)in theSet AutoML related configurationscell.
epoch_multiplier: The number of epochs, determined by multiplying this value with the resources (R). The default value is 10 and can be adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.In the API notebook, enable this as follows:
epoch_multiplier:value of epoch_multiplier you want to set (integer)inside theautoml_informationdictionary variable in theSet AutoML related configurationscell.
In the CLI notebook, enable this as follows:
metadata["epoch_multiplier"]:value of epoch_multiplier you want to set (integer)in theSet AutoML related configurationscell.
The values of R and nu are computed to determine the number of experiemnts to run within each stage of the Hyperband run and the corresponding number of epochs for each experiment. For example, setting
"R=27, Nu=3, epoch_multiplier=10"will run three stages of experiments, as described in Table 3 of the AutoML Algorithm Explanation section:The first stage proposes 27 new recommendations to run for 10 epochs. Then, Hyperband keeps the 9 (27/3) best performing recommendations to run another 30 (10*3) epochs, and repeats until one recommendation remains.
The second and third stages respectively propose 9 and 3 new recommendations (based on Table 3) and follow the same procedure.
After all the experiments conclude, Hyperband returns the training config file and binary weights to the experiment that achieved the best accuracy.
BOHB ("automl_algorithm": "bohb")
automl_kde_samples— trials used to fit the internal performance model (raise if search feels too random).automl_top_n_percent— fraction of top trials that define “good” configs (default 15; lower = more selective).automl_min_points_in_model— random trials to run before model-guided search begins.
BFBO ("automl_algorithm": "bfbo")
automl_max_recommendations— total experiments (same meaning as Bayesianmax_recommendations).
ASHA ("automl_algorithm": "asha")
automl_max_concurrent— maximum trials running at the same time.automl_max_trials— total trial budget before the search stops.
PBT ("automl_algorithm": "pbt")
automl_population_size— number of parallel models (needs at least this many GPU slots).automl_eval_interval— epochs between population ranking and replacement rounds.
DEHB ("automl_algorithm": "dehb")
automl_mutation_factor— exploration vs. exploitation trade-off (0–2; default ~0.5).automl_crossover_prob— fraction of params inherited from the mutant (0–1; default ~0.5).
Hyperband-ES ("automl_algorithm": "hyperband_es")
automl_early_stop_threshold— validation metric below this value triggers trial termination.automl_min_early_stop_epochs— minimum epochs a trial must run before early stopping applies.
SLURM Auto-Resume
When running AutoML on SLURM clusters, interrupted trials are automatically re-queued and resumed from the last available checkpoint. No additional configuration is required; the SLURM backend detects preemptions and restarts the affected trial automatically.
Finally, the user can change any training-related spec parameters–such as batch size, learning rate, weight regularizer, and frequency of checkpoints–to be reflected during each experiment. This step is necessary because these kinds of parameters are often dependent on the computing hardware specifications (e.g. GPU memory size).
After changing the above parameters, you can run the cells of the notebook. While running the notebook, some cells
take time to complete execution. In the API notebook, the cells poll the status of the job every 15 seconds. The
cell execution is completed only when the status switches to Done or Error.
AutoML Outcomes#
For the AutoML train cell, you can see the following status indicators during cell execution. For the TAO-Client
AutoML notebook, you can either view these indicators or view the training logs of the experiments. By default, viewing
status indicators is enabled; you can switch to viewing logs by setting poll_automl_stats = True
{
"c9efeab0-0756-47d5-b6ae-3b51e2c02b34": {
"detailed_status": {
"message": "AutoML run is successful with best checkpoints under /results/c9efeab0-0756-47d5-b6ae-3b51e2c02b34",
"status": "SUCCESS"
}
},
"ea7bba9f-c88f-41ba-8b44-75b9c6fc1ec5": {
"detailed_status": {
"date": "5/6/2025",
"message": "train action completed successfully for segformer",
"status": "SUCCESS",
"time": "15:36:29"
},
"epoch": 9,
"eta": "0:00:00",
"kpi": [
{
"metric": "train_loss",
"values": {
"0": 0.7115289568901062,
"9": 0.3484381139278412
}
},
{
"metric": "val_loss",
"values": {
"8": 0.3261469602584839,
"9": 0.3261469602584839
}
},
{
"metric": "val_miou",
"values": {
"8": 0.5810916124069727,
"9": 0.5810916124069727
}
},
],
"max_epoch": 9,
"time_per_epoch": "0:00:01.477164",
}
}
Understanding the AutoML Status#
The status response provides detailed information about the AutoML run and its experiments:
AutoML Brain ID (e.g.,
c9efeab0-0756-47d5-b6ae-3b51e2c02b34): This is a unique identifier for the AutoML orchestrator that manages the entire AutoML run. The status shows the overall run information, including where the best checkpoints are stored.Experiment ID (e.g.,
ea7bba9f-c88f-41ba-8b44-75b9c6fc1ec5): Each experiment within the AutoML run has its own unique identifier. The AutoML brain runs multiple experiments with different hyperparameter configurations, each with its own UUID.Detailed Status: Contains information about the experiment’s current state:
message: Descriptive information about the experiment statusstatus: Current state (e.g., “SUCCESS”)dateandtime: When the status was last updated
Training Progress:
epoch: Current training epochmax_epoch: Total number of epochs for this experimenteta: Estimated time remaining for the current experimenttime_per_epoch: Average time taken per epoch
Performance Metrics (
kpi): Lists various metrics tracked during training:train_loss: Training loss values at different epochsval_loss: Validation loss valuesval_miou: Validation mean intersection over union (for segmentation tasks)Other metrics specific to the model type
This status information helps you monitor the progress of your AutoML run, track the performance of individual experiments, and understand how the hyperparameter optimization is proceeding.
Note
The eta for AutoML experiment completion, that is just the training time, in addition to the following times, which are dependent on the config values you set:
The time for evaluation at every n epochs
The time for miscellaneous actions like saving a checkpoint at every n epochs
The time for loading the model from disk for resuming training in Hyperband
Results of AutoML experiments#
The below table provides several experimental results of network models on different AI applications.
These numbers are profiled with the default AutoML specific parameters:
max_recommendation=20for Bayesian andR=27, nu=3for Hyperband. For some network models (efficientdet, mask_rcnn, multitask_classification), the defaultautomlparameter is adjusted by the Hyperband Parameter Auto-adjustment Mechanism to beR=9, nu=3for Hyperband.All experiments are executed on a single GPU (Tesla V100) with Intel Xeon CPU E5-2698 v4.
When using multi-GPU mode, the time taken is expected to scale accordingly; similarly, using more powerful GPUs will reduce the execution time.
Table1. The estimated time for an AutoML (Bayesian/Hyperband) experiment for each network model
model |
epoch |
dataset |
single experiment |
Bayesian |
Hyperband |
|---|---|---|---|---|---|
detectnet_v2 |
80 |
FLIR20 |
45.7min |
913.2min |
534.2min |
efficientdet |
6 |
FLIR20 |
82min |
1640min |
410min |
faster_rcnn |
80 |
FLIR20 |
252min |
5040 min |
2948.4min |
retinanet |
100 |
FLIR20 |
75min |
1500min |
702min |
ssd |
80 |
FLIR20 |
174min |
3480min |
2035min |
yolo3 |
80 |
FLIR20 |
60min |
1200min |
702min |
yolo4 |
80 |
FLIR20 |
110min |
2200min |
1287min |
yolo4_tiny |
80 |
FLIR20 |
87min |
1740min |
1017.9min |
mask_rcnn |
5 |
FLIR20 |
66min |
1320min |
237.6min |
lprnet |
24 |
OpenALPR |
1.1min |
22min |
39.2min |
multiclass_classification |
80 |
Pascal VOC |
80min |
1600min |
1279.7min |
multitask_classification |
10 |
Fashion Product |
12.5min |
250min |
799.8min |
unet |
50 |
ISBI |
4.5min |
90min |
98.5min |