AutoML#

AutoML is a TAO API service that automatically selects deep learning hyperparameters for a chosen model and dataset. The TAO API provides a Jupyter notebook interface to try the AutoML feature.

Each AutoML run contains multiple training experiments. At the end of an AutoML run, you can access the config containing the hyperparameters of the best performing model amongst the multiple experiments, along with the binary weight file for deploying the model to your application.

AutoML supports all TAO models except the MAL - Auto Labeling model.

AutoML Algorithm Explanation#

This section gives additional details of the supported AutoML algorithms.

Choosing an Algorithm#

If you are unsure which algorithm to pick, use this guidance:

  • New to AutoML / quick start — use bayesian. It is the most well-understood algorithm and produces consistently good results with minimal tuning.

  • Limited GPU budget, need speed — use hyperband. It discards poor trials early and converges faster than a full Bayesian search.

  • Large search space (many hyperparameters) — use bohb or dehb. Both scale better than pure Bayesian approaches in high-dimensional spaces.

  • Multiple GPUs available — use bfbo (generates a whole batch of recommendations in parallel) or asha (asynchronous, no sync barrier between workers).

  • Want hyperparameters to evolve during training — use pbt. Unlike other algorithms, PBT can change learning rate or other parameters mid-training, not just between runs.

  • Quick pruning of bad trials — use hyperband_es. It adds explicit early-stopping thresholds on top of Hyperband to cut off clearly failing trials sooner.

Bayesian Optimization#

Bayesian optimization aims to identify optimal configurations more quickly than standard baselines (such as standard random search) by adaptively selecting hyperparameters based on past experimental information.

  • Use a surrogate model to fit the Gaussian Process (GP) to existing data (X,y), where X is a vector of recommendations and y is the observed validation map.

  • Bayesian Optimization adaptively proposes new recommendations based on the fitted GP that produces the best improvement in expectation until reaching the number of max_recommendations.

Hyperband#

Hyperband addresses the issue of expensive hyperparameter optimization by speeding up random search through adaptive resource allocation. Hyperband follows the SuccessiveHalving algorithm to uniformly allocate a budget to a set of hyperparameter recommendations. During AutoML runs, Hyperband evaluates the performance of all recommendations, throws out the worst half, and repeats this process until one recommendation remains (see this research paper for more details). Three hyperband-related parameters (R, nu, epoch_multiplier) pre-define a hyperband sequence of trials based on how many trials to perform in each round and how many resources to give to each trial in that round.

  • The R parameter determines the maximum resources (i.e. the number of epochs and recommendations).

  • The Nu parameter controls the proportion of recommendations discarded in each use of the SuccessiveHalving algorithm.

  • The epoch_multiplier and the resources (r) are multiplied in each SuccessiveHalving iterations(i) to determine the epoch number of each trial. Each run of SuccessiveHalving is referred to as a stage, and each stage contains SuccessiveHalving iterations(i).

The below tables describe the SuccessiveHalving iterations(i), number of hyperparameter recommendation(n), and resources(r) given the R and nu.

Table2. The pre-defined Hyperband experiment schedule when set with R = 81, nu=3.

s=0

s=1

s=2

s=3

i

n

r

n

r

n

r

n

r

0

81

1

27

3

9

9

6

27

1

27

3

9

9

3

27

2

81

2

9

9

3

27

1

81

3

3

27

1

81

4

1

81

Table3. The pre-defined Hyperband experiment schedule when set with R = 27, nu=3.

s=0

s=1

s=2

i

n

r

n

r

n

r

0

27

1

9

3

6

9

1

9

3

3

9

2

27

2

3

9

1

27

3

1

27

BOHB (Bayesian Optimization and Hyperband)#

BOHB combines Bayesian optimization with Hyperband’s early stopping. It builds a statistical model of which hyperparameter regions tend to do well, then uses Hyperband’s resource scheduling to allocate more epochs to the most promising trials. The result is faster convergence than vanilla Bayesian search when you have a large number of hyperparameters.

Key parameters:

  • automl_kde_samples: Number of completed trials used to fit the performance model. Raise this if the search feels too random early on; lower it to react faster to new results.

  • automl_top_n_percent (default: 15): What fraction of past trials counts as “good” when building the model. Lower = more selective = more focused search.

  • automl_min_points_in_model: How many trials to run randomly before the model kicks in. Useful for very cold starts where you have no data yet.

BFBO (Batch First Bayesian Optimization)#

A variant of Bayesian optimization that generates a full batch of recommendations at once before waiting for results. This means you can fill all your available GPUs immediately rather than waiting for each experiment to finish before picking the next configuration.

Key parameters:

  • automl_max_recommendations: Total experiments to run (same meaning as the bayesian algorithm’s max_recommendations).

ASHA (Asynchronous Successive Halving Algorithm)#

ASHA is like Hyperband but without requiring all workers to finish a stage before the next begins. Workers that finish early move straight to the next round, keeping all GPUs busy even when some experiments run faster than others. Best for heterogeneous clusters or when your experiments have very different runtimes.

Key parameters:

  • automl_max_concurrent: How many trials to run at the same time.

  • automl_max_trials: Total trials before stopping the search.

PBT (Population-Based Training)#

PBT is fundamentally different from the other algorithms: it does not pick hyperparameters once at the start and then train. Instead it runs a group of models in parallel and periodically replaces the worst-performing ones with copies of the best-performing ones, then randomly mutates their hyperparameters. This lets the learning rate, batch size, and other parameters evolve during training — which is especially useful for schedules that are hard to set in advance.

Key parameters:

  • automl_population_size: Number of parallel models. Larger populations explore more of the hyperparameter space but require more GPUs.

  • automl_eval_interval: How often (in epochs) the population is ranked and the weakest members are replaced.

DEHB (Differential Evolution with Hyperband)#

DEHB pairs Hyperband’s multi-fidelity scheduling (cheap early evaluation, expensive late evaluation) with differential evolution — a mutation-based global optimizer that explores the hyperparameter space by combining existing solutions. It tends to outperform vanilla Hyperband on very large or continuous search spaces.

Key parameters:

  • automl_mutation_factor: How far a mutant configuration can move from its parent (0–2). Higher values = more exploration; lower values = more exploitation.

  • automl_crossover_prob: Fraction of hyperparameters inherited from the mutant vs. the original (0–1). Higher = mutant-leaning; lower = parent-leaning.

Hyperband with Early Stopping (hyperband_es)#

A drop-in extension of Hyperband that can terminate a trial mid-stage if its validation metric falls below a threshold — even before the stage’s epoch budget is exhausted. This is useful when you know that any run below a certain accuracy at epoch 5 is never going to recover by epoch 20.

Key parameters:

  • automl_early_stop_threshold: The metric value below which a trial is stopped. Set this based on the minimum acceptable validation accuracy/mIoU/etc. for your task.

  • automl_min_early_stop_epochs: A trial must run at least this many epochs before it can be early-stopped. Prevents killing trials before they have had a chance to warm up.

Hyperband Parameter Auto-adjustment Mechanism#

By default, the Hyperband pamameter R is set to 27 and nu is set to 3. Based on these values and the training max_epoch of each experiment, the hyperband parameter auto-adjustment mechanism calculates the epoch_multiplier by max_epoch^(log(nu)/log(R)). Here, if the calculated epoch_multiplier is less than 3, then R is reduced by 1 and the epoch_multiplier is recalculated until this value is equal to or greater than 3.

Prerequisites#

  • Before using AutoML, you need to deploy the TAO REST API service using the TAO API Setup steps.

    • The Hardware and Software requirements mentioned in the TAO API Setup section are also applicable to AutoML.

  • After deploying the TAO API, you can use the following commands on the host machine to obtain the node ip_address and port_number for use in the notebooks:

    • ip_address: hostname -i

    • port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'

AutoML Notebooks#

AutoML supports all the notebooks that support the training functionality–except for MAL - Auto Labeling.

You have to change automl_enabled variable to True at the beginning and run through the notebook

Once a notebook is available on your computer, you can run the notebook with the following:

cd <path_to_your_notebook>
jupyter-lab --ip 0.0.0.0 --allow-root --port 8888

Note

We recommend running only one AutoML notebook at a time. Each experiment within AutoML is a separate Kubernetes job, and Kubernetes performs scheduling based on jobs submitted to the queue. Hence, running multiple models across notebooks will fill the queue with AutoML experiments of different models, extending the execution time to complete a single AutoML experiment.

Getting Started#

[Mandatory] User Inputs#

For each notebook, you need to provide/modify the following parameters:

  • model_name: The list of supported network models will be present in the notebook cell.

  • dataset path: For each task, the dataset should be in a certain folder structure (provided in the dataset description in the notebooks).

  • automl_algorithm: The AutoML algorithm options to use. A brief explanation of the algorithms can be found in the AutoML Algorithm Explanation section.

    • bayesian: Adaptively proposes new hyperparameters based on past experiments. This algorithm generally provides better accuracy results, but with a longer execution time for all experiments, when compared to Hyperband.

    • hyperband: Hyperband accelerates the expensive hyperparameter optimization using random search with adaptive resource allocation.

    • bohb: Combines Bayesian optimization with Hyperband for sample-efficient search.

    • bfbo: Batch First Bayesian Optimization — generates batches of recommendations in parallel.

    • asha: Asynchronous Successive Halving — well-suited for distributed and heterogeneous compute.

    • pbt: Population-Based Training — adapts hyperparameters online during training.

    • dehb: Differential Evolution with Hyperband — effective for large, high-dimensional search spaces.

    • hyperband_es: Hyperband with early stopping — adds per-trial threshold-based early termination.

[Optional] User Inputs#

For each model, TAO has set a default set of parameters to run the AutoML search. You can add new valid parameters via additional_automl_parameters or remove some parameters from the default list via remove_default_automl_parameters`.

Each notebook contains hyperlinks in the Set AutoML related configurations cell to view the valid list of parameters of a particular model:

  • additional_automl_parameters: Add additional parameters to the AutoML search algorithm using this list of strings (e.g. additional_automl_parameters = ['parameter1','parameter2']).

    Any parameter in the hyperlink table that does not have the automl_enabled column set to True/False can be added to the AutoML search space.

    For example, for DetectNet_v2:

    • dataset_config.target_class_mapping can’t be added to the additional_automl_parameters list as it is not eligible to be included in the search space.

    • training_config.regularizer.weight doesn’t make sense to add to this list, as it is already enabled.

    • augmentation_config.preprocessing.output_image_width can be added, as the automl_enabled column is set to neither True nor False.

  • remove_default_automl_parameters: Remove parameters that are enabled by default for AutoML search (e.g. remove_default_automl_parameters = ['parameter1','parameter2']).

    Any parameter in the hyperlink table that has the automl_enabled column set to True can be removed from the AutoML search space.

    For example, for DetectNet_v2, training_config.regularizer.weight can be removed from the AutoML search space.

Treating a List Parameter as a Continuous Range

Some parameters have a predefined list of valid values (valid_options). By default, the AutoML optimizer samples only from those values. If you want the optimizer to treat the parameter as a continuous float instead — exploring values between and beyond the list items — set disable_list: True for that parameter:

custom_automl_ranges = {
    "train.learning_rate": {
        "disable_list": True  # Explore as a continuous float, not just the listed values
    }
}

This gives the optimizer more freedom and can find better configurations when the predefined list is sparse.

Weighted List Options

For parameters with discrete list options (valid_options), you can specify weights to prioritize certain options over others during AutoML search. This is particularly useful when you know certain hyperparameter values are more likely to perform well.

  • option_weights: Assign weights to list options to control their sampling probability. The weights must be positive numbers and their length must match the number of valid options.

    For example, if a parameter has valid_options: [0.1, 0.5, 1.0, 2.0] and you want to favor the middle values, you can set:

    custom_automl_ranges = {
        "parameter_name": {
            "option_weights": [0.1, 0.4, 0.4, 0.1]  # Higher weights = higher probability
        }
    }
    

    The weights are automatically normalized, so [1, 2, 2, 1] is equivalent to [0.1, 0.4, 0.4, 0.1].

[Optional] AutoML Algorithm-Specific Parameters#

There are some algorithm-specific parameters set to default values that determine the AutoML experiment schedules. You can optionally modify them as follows:

Bayesian

  • max_recommendations: The maximum number of full-scale training experiments to run. The default value is 20.

    For example, setting this value to 10 will run 10 training experiments in sequential order, as the training config file of nth experiment is computed from the (n-1)th experiment. At the end of 10 experiments, the algorithm returns the training config file and binary weights to the experiment that achieved the best accuracy.

    • In the API notebook, enable this as follows:

      • automl_max_recommendations:number_of_recommendations you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.

    • In the CLI notebook, enable this as follows:

      • metadata["automl_max_recommendations"]:number_of_recommendations you want to set (integer) in the Set AutoML related configurations cell.

Hyperband

  • R: The maximum resources (i.e. the number of recommendations and maximum epochs)

    • The default value is set to 27 and is adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.

    • In the API notebook, enable this as follows:

      • automl_R: integer value inside automl_information dictionary variable in the Set AutoML related configurations cell

    • In the CLI notebook, enable this as follows:

      • metadata["automl_R"]: integer value in the Set AutoML related configurations cell.

  • Nu: The proportion of recommendations discarded in each use of the SuccessiveHalving algorithm. The default value is 3.

    • In the API notebook, enable this as follows:

      • automl_nu::value of nu you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.

    • In the CLI notebook, enable this as follows:

      • metadata["automl_nu"]::value of nu you want to set (integer) in the Set AutoML related configurations cell.

  • epoch_multiplier: The number of epochs, determined by multiplying this value with the resources (R). The default value is 10 and can be adjusted as explained in the Hyperband Parameter Auto-adjustment Mechanism section.

    • In the API notebook, enable this as follows:

      • epoch_multiplier:value of epoch_multiplier you want to set (integer) inside the automl_information dictionary variable in the Set AutoML related configurations cell.

    • In the CLI notebook, enable this as follows:

      • metadata["epoch_multiplier"]:value of epoch_multiplier you want to set (integer) in the Set AutoML related configurations cell.

    The values of R and nu are computed to determine the number of experiemnts to run within each stage of the Hyperband run and the corresponding number of epochs for each experiment. For example, setting "R=27, Nu=3, epoch_multiplier=10" will run three stages of experiments, as described in Table 3 of the AutoML Algorithm Explanation section:

    • The first stage proposes 27 new recommendations to run for 10 epochs. Then, Hyperband keeps the 9 (27/3) best performing recommendations to run another 30 (10*3) epochs, and repeats until one recommendation remains.

    • The second and third stages respectively propose 9 and 3 new recommendations (based on Table 3) and follow the same procedure.

    • After all the experiments conclude, Hyperband returns the training config file and binary weights to the experiment that achieved the best accuracy.

BOHB ("automl_algorithm": "bohb")

  • automl_kde_samples — trials used to fit the internal performance model (raise if search feels too random).

  • automl_top_n_percent — fraction of top trials that define “good” configs (default 15; lower = more selective).

  • automl_min_points_in_model — random trials to run before model-guided search begins.

BFBO ("automl_algorithm": "bfbo")

  • automl_max_recommendations — total experiments (same meaning as Bayesian max_recommendations).

ASHA ("automl_algorithm": "asha")

  • automl_max_concurrent — maximum trials running at the same time.

  • automl_max_trials — total trial budget before the search stops.

PBT ("automl_algorithm": "pbt")

  • automl_population_size — number of parallel models (needs at least this many GPU slots).

  • automl_eval_interval — epochs between population ranking and replacement rounds.

DEHB ("automl_algorithm": "dehb")

  • automl_mutation_factor — exploration vs. exploitation trade-off (0–2; default ~0.5).

  • automl_crossover_prob — fraction of params inherited from the mutant (0–1; default ~0.5).

Hyperband-ES ("automl_algorithm": "hyperband_es")

  • automl_early_stop_threshold — validation metric below this value triggers trial termination.

  • automl_min_early_stop_epochs — minimum epochs a trial must run before early stopping applies.

SLURM Auto-Resume

When running AutoML on SLURM clusters, interrupted trials are automatically re-queued and resumed from the last available checkpoint. No additional configuration is required; the SLURM backend detects preemptions and restarts the affected trial automatically.

Finally, the user can change any training-related spec parameters–such as batch size, learning rate, weight regularizer, and frequency of checkpoints–to be reflected during each experiment. This step is necessary because these kinds of parameters are often dependent on the computing hardware specifications (e.g. GPU memory size).

After changing the above parameters, you can run the cells of the notebook. While running the notebook, some cells take time to complete execution. In the API notebook, the cells poll the status of the job every 15 seconds. The cell execution is completed only when the status switches to Done or Error.

AutoML Outcomes#

For the AutoML train cell, you can see the following status indicators during cell execution. For the TAO-Client AutoML notebook, you can either view these indicators or view the training logs of the experiments. By default, viewing status indicators is enabled; you can switch to viewing logs by setting poll_automl_stats = True

{
    "c9efeab0-0756-47d5-b6ae-3b51e2c02b34": {
        "detailed_status": {
            "message": "AutoML run is successful with best checkpoints under /results/c9efeab0-0756-47d5-b6ae-3b51e2c02b34",
            "status": "SUCCESS"
        }
    },
    "ea7bba9f-c88f-41ba-8b44-75b9c6fc1ec5": {
        "detailed_status": {
            "date": "5/6/2025",
            "message": "train action completed successfully for segformer",
            "status": "SUCCESS",
            "time": "15:36:29"
        },
        "epoch": 9,
        "eta": "0:00:00",
        "kpi": [
            {
                "metric": "train_loss",
                "values": {
                    "0": 0.7115289568901062,
                    "9": 0.3484381139278412
                }
            },
            {
                "metric": "val_loss",
                "values": {
                    "8": 0.3261469602584839,
                    "9": 0.3261469602584839
                }
            },
            {
                "metric": "val_miou",
                "values": {
                    "8": 0.5810916124069727,
                    "9": 0.5810916124069727
                }
            },
        ],
        "max_epoch": 9,
        "time_per_epoch": "0:00:01.477164",
    }
}

Understanding the AutoML Status#

The status response provides detailed information about the AutoML run and its experiments:

  • AutoML Brain ID (e.g., c9efeab0-0756-47d5-b6ae-3b51e2c02b34): This is a unique identifier for the AutoML orchestrator that manages the entire AutoML run. The status shows the overall run information, including where the best checkpoints are stored.

  • Experiment ID (e.g., ea7bba9f-c88f-41ba-8b44-75b9c6fc1ec5): Each experiment within the AutoML run has its own unique identifier. The AutoML brain runs multiple experiments with different hyperparameter configurations, each with its own UUID.

  • Detailed Status: Contains information about the experiment’s current state:

    • message: Descriptive information about the experiment status

    • status: Current state (e.g., “SUCCESS”)

    • date and time: When the status was last updated

  • Training Progress:

    • epoch: Current training epoch

    • max_epoch: Total number of epochs for this experiment

    • eta: Estimated time remaining for the current experiment

    • time_per_epoch: Average time taken per epoch

  • Performance Metrics (kpi): Lists various metrics tracked during training:

    • train_loss: Training loss values at different epochs

    • val_loss: Validation loss values

    • val_miou: Validation mean intersection over union (for segmentation tasks)

    • Other metrics specific to the model type

This status information helps you monitor the progress of your AutoML run, track the performance of individual experiments, and understand how the hyperparameter optimization is proceeding.

Note

The eta for AutoML experiment completion, that is just the training time, in addition to the following times, which are dependent on the config values you set:

  • The time for evaluation at every n epochs

  • The time for miscellaneous actions like saving a checkpoint at every n epochs

  • The time for loading the model from disk for resuming training in Hyperband

Results of AutoML experiments#

The below table provides several experimental results of network models on different AI applications.

  • These numbers are profiled with the default AutoML specific parameters: max_recommendation=20 for Bayesian and R=27, nu=3 for Hyperband. For some network models (efficientdet, mask_rcnn, multitask_classification), the default automl parameter is adjusted by the Hyperband Parameter Auto-adjustment Mechanism to be R=9, nu=3 for Hyperband.

  • All experiments are executed on a single GPU (Tesla V100) with Intel Xeon CPU E5-2698 v4.

  • When using multi-GPU mode, the time taken is expected to scale accordingly; similarly, using more powerful GPUs will reduce the execution time.

Table1. The estimated time for an AutoML (Bayesian/Hyperband) experiment for each network model

model

epoch

dataset

single experiment

Bayesian

Hyperband

detectnet_v2

80

FLIR20

45.7min

913.2min

534.2min

efficientdet

6

FLIR20

82min

1640min

410min

faster_rcnn

80

FLIR20

252min

5040 min

2948.4min

retinanet

100

FLIR20

75min

1500min

702min

ssd

80

FLIR20

174min

3480min

2035min

yolo3

80

FLIR20

60min

1200min

702min

yolo4

80

FLIR20

110min

2200min

1287min

yolo4_tiny

80

FLIR20

87min

1740min

1017.9min

mask_rcnn

5

FLIR20

66min

1320min

237.6min

lprnet

24

OpenALPR

1.1min

22min

39.2min

multiclass_classification

80

Pascal VOC

80min

1600min

1279.7min

multitask_classification

10

Fashion Product

12.5min

250min

799.8min

unet

50

ISBI

4.5min

90min

98.5min