Clara AutoML high-level design

Training a good model can take a lot of parameter tuning and testing, which can be time-consuming and tedious. AutoML tries to make the process less painful by searching for the most optimal parameter settings to be applied to the training workflow.

The primary goal is to build the AutoML system as a platform for researchers to find the best AutoML strategies without having to deal with the complexity of getting the strategies executed. To achieve this goal, the system is built based on the abstraction of essential AutoML components and allows researchers to bring their own component implementations.

The second goal is to provide a reference implementation of the key AutoML components to demonstrate how AutoML mechanically works end-to-end.

The third goal is to use the reference implementation to produce a model that can obtain better results than a model from a typical manual configuration.

Clara Train

The user already has a completely defined Clara Train training config but the performance of the trained model may not be optimal. The user wants AutoML to find the best training configuration and to produce the best model possible.

BYOC

The user can bring their own AutoML component implementation and conduct AutoML with Clara Train.

The core of AutoML is parameter search, which generates parameter setting recommendations based on a given search space and performance score.

The search space defines the type of components, parameters to be searched, and search ranges of each parameter. In general, parameter search can be done for the following aspects of the model training:

  • Network architecture selection

  • Network parameter settings

  • Learning rate settings

  • Transform settings

At the high level, the AutoML process can be viewed as repeated interactions between a controller and an executor.

The controller implements the recommendation logic that, when requested, produces 0, 1 or more recommendations.

The executor is responsible for the execution of a recommendation.

The AutoML workflow engine is responsible for managing the interactions between the controller and the executor, and for efficient scheduling of recommendation execution. The high-level control flow is as follows.

Step 1: Generate search space definition

Call the executor to generate the search space definition.

Step 2: Initialize the controller

The controller is initialized with the search space definition.

Step 3: Generate initial recommendations

Call the controller to generate the initial set of recommendations for parameter settings. The controller must produce one or more recommendations.

Step 4: Schedule jobs

A job is scheduled for each recommendation. Depending on the configured number of workers, jobs will be executed in parallel as much as possible.

Workers are shared resources. A job is assigned to a worker as soon as the worker is available. Workers run in parallel.

Step 5: Execute jobs

Each job is assigned to a worker, which calls the executor to execute the job. At the end of normal execution, a score is produced that measures the quality of the recommendation. The worker now becomes available for the next job.

Note

Score is conceptual - it could be a simple number or an object of any type. The workflow engine does not interpret the meaning of the score - it simply passes it to the controller.


Step 6: Refine recommendations

Whenever a job is finished, the engine calls the controller for a refined set of recommendations based on the job’s finishing score. The controller can produce 0, 1 or more recommendations.

If recommendations are produced, a job is scheduled for each recommendation, as in Step 4.

Steps 4 to 6 are repeated until one of the following conditions occurs:

  • All jobs are done and no more recommendations from the controller

  • Any component asked to stop the workflow. For example, one can ask to stop work because a satisfactory score has been achieved. Or the controller can ask to stop in case it runs into a condition that makes it impossible to continue.

  • Any execution error (e.g. run time exception).

The AutoML workflow engine’s processing logic is very generic. The actual detail of how the recommendations are produced is provided by the controller component; and the execution of the recommendation is done by the executor component.

In addition to these two essential components, an end-to-end system usually desires more functions beyond these two components. To enable infinite function extension, the AutoML workflow engine implements a handler mechanism.

A handler is an object that listens to certain AutoML workflow events and takes actions on such events. For example, you could write a handler to collect and analyze stats during the AutoML process, or to manage the disk space consumption during the process.

A more important role of handlers is perhaps to adapt the AutoML system to a particular way of execution. If your execution system requires a complex setup, you can put all that complexity into your executor, but that may not be the best strategy. A cleaner way could be to move the complexity into handlers and let the executor focus solely on the execution logic.

The following are the supported workflow events:

  • Start AutoML - the AutoML process is about to start. This is the first event of the workflow and will happen only once.

  • Search space available - the search space has been determined. This is the second event of the workflow and will happen only once.

  • Recommendations available - recommendations from the controller are available. This event can occur many times.

  • Start job - a job is about to get started. This event can occur many times.

  • End job - a job has finished. This event can occur many times.

  • End AutoML - the AutoML process is finished. This is the last event of the workflow and will happen only once.

Note
  • “Start AutoML” and “End AutoML” are guaranteed to happen once.

  • If “Start Job” ever happens, then a pairing “End Job” event is guaranteed to happen.

  • Other event types are not guaranteed to happen.

  • For each event, all handlers are called in the order they are configured.

  • Any handler can ask to stop the workflow. Currently the workflow engine will stop if any handler does so. This policy may be revisited in the future.

AutoML has been adapted to Clara Train’s Medical Model Archive (MMAR) based training by providing an MMAR-specific implementation of the executor and handlers.

Folder structure

AutoML training is a process that has to go through many rounds of model training, which will produce many files and artifacts. To properly manage these artifacts, a standard folder structure is used, based on the MMAR convention.

A new “automl” folder is added under the MMAR’s root. This folder is the place for all AutoML runs.

The user can perform any number of AutoML experiments. Each experiment is called a run and must have a unique name. For each run, a folder in the “automl” folder will be named after the name of that run.

Each run can have any number of job executions. Each job is an MMAR-based Clara Training, and has its own MMAR folder, named after the job’s name.

In summary, the folder structure looks like this, using the segmentation_ct_spleen MMAR as example:

Copy
Copied!
            

segmentation_ct_spleen (the main MMAR) config config_automl.json config_train.json … commands automl.sh automl_train_round.sh … automl run_a W1_1_J1 (this a MMAR for Job 1 executed by Worker 1) W1_2_J3 W2_1_J2 W3_1_J4 … run_b …

You may notice that a few new files are added to the MMAR structure (config_automl.json in “config”, automl.sh and automl_train_round.sh in “commands”). These will be described later in this doc.

Job MMAR names

Job MMARs are named with the following convention:

Copy
Copied!
            

W<workerId>_<workerJobSeqNum>_J<jobId> - where <workerId> is the worker’s ID - an integer starting from 1; - <workerJobSeqNum> is the job’s sequence number within the worker, starting from 1; - <jobId> is the overall job ID, starting from 1.


ReinforcementController

This controller performs the following operations sequentially:

  1. Separate SearchSpace into enum subspace and float subspace.

  2. Generate each combination based on enum subspace. Pair each one of them with the default values of float subspace to form each SearchResult. Return all of the generated SearchResults in a list of Recommendation when initial_recommendation is called.

  3. Collect all Outcomes from step 2 to determine the best score, and its SearchResult.

  4. Extract the enum portion of SearchResult of best score from the above step.

  5. With that enum portion unchanged, generate SearchResult with updated float portion as guided by reinforcement learning (Yang D., Roth H., Xu Z., Milletari F., Zhang L., Xu D. (2019) Searching Learning Strategy with Reinforcement Learning for 3D Medical Image Segmentation. In: Shen D. et al. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019.)

  6. Continue to loop over the previous step.

Users can set the max_rounds so this controller will request AutoML to stop when it reaches the max number of rounds, as follows:

Copy
Copied!
            

"controller": { "name": "ReinforcementController", "args": { "max_rounds": 500 } }


MMARExecutor

This component implements MMAR-based training execution:

  • When called for the search space, it determines the search space from config_train.json defined in the main MMAR.

  • When called to execute the job, it creates a config_train.json with the recommendation, creates the job MMAR, and starts the training with Clara Train.

MMARHandler

This handler adapts the workflow to the MMAR standard:

  • In processing the “Start AutoML” event, it creates the “automl” folder if it does not already exist. It also checks the validity of the specified run_id and creates the run folder in “automl”. This ensures that the MMARExecutor will be able to create the job MMAR folder in the run folder.

In addition, this handler can be configured to:

  • Manage disk space. You can set the max number of job MMARs to keep. When processing the “End Job” event, it checks the score and only keeps the specified number of MMARs with the top scores. Other MMARs are removed.

  • Early stopping. You can specify a score threshold. When processing the “End Job” event, if the job finishes with a score that meets or exceeds the specified threshold, the handler will ask to stop the whole workflow (by setting the “stop work” flag).

This handler also constantly maintains the location of the MMAR with the best training score. It creates/updates the symlink “best” to point to the current best MMAR.

When processing the “End AutoML” event, this handler prints the overall training result of the top MMARs.

StatsHandler

This is a simple general-purpose handler that collects job running stats for each worker. For each worker, it computes things like the number of jobs started, number of jobs finished and unfinished, and total amount of time the worker worked. It also shows the total amount of time used by the whole workflow.

Configure AutoML

Following Clara Train convention, AutoML’s component configuration is done via a JSON file, “config_automl.json”, in the “config” folder of the main MMAR. The following is a typical example:

Copy
Copied!
            

{ "handlers": [ { "name": "MMARHandler", "args": { "num_mmars_to_keep": 3, "stop_threshold": 0.9 } }, { "name": "StatsHandler" } ], "executor": { "name": "MMARExecutor" }, "controller": { "name": "TestController", "args": { "total_recs": 10, "max_recs_each_time": 3 } } }

Option to only copy essential config and command files

In Clara Train 3.1, AutoML has been enhanced with a flag to allow for copying only essential config and command files of the MMAR to avoid duplicating any test files, data, or scripts that could be contained within. This can be set based on the configuration of the MMARHandler, as shown in the following example.

To copy only the key content from the base MMAR, set “key_mmar_content_only” to true (default is false). In this case, you must also specify the data_list_file_name explicitly if it is in the MMAR’s “config” folder.

Copy
Copied!
            

{ "handlers": [ { "name": "MMARHandler", "args": { "num_mmars_to_keep": 20, "stop_threshold": 0.8, "train_config_file": "{trainconf}", "work_dir": "{workdir}", "key_mmar_content_only": true,"data_list_file_name": "dataset_0.json" } }, ], ...

BYOC for AutoML

You can Bring your own components (BYOC) for AutoML by bringing your own implementation for any of these AutoML components. When implementing your components, you must follow the API signatures of the component definitions - please refer to the component API specs in AutoML user guide.

Once you have developed your component class, you can use/configure it in the config_automl.json. For example:

Copy
Copied!
            

"controller": { "path": "path.to.your.controller.class", "args": { ... } }

You can do the same for other components (handlers and executor).

Note

Use “path” (instead of “name”) when specifying your class path, and make sure that “path.to.your.controller.class” is accessible through PYTHONPATH.


Specify search space

A Parameter Range Locator (PRL) is a formatted string that uniquely identifies a search parameter range definition with a format like:

Copy
Copied!
            

domain.dtype.extra

“domain” specifies the type of the object (learning rate, network, transform, etc.) “dtype” specifies the data type of the range definition (float, enum, dynamic, etc.). Note dynamic type is not supported in the first version of AutoML. “extra” is extra information to make the PRL unique. It is optional.

Attention

A search space is defined as a dictionary of PRLs mapped to their range definitions.

To define the search space for the AutoML process, the user augments “config_train.json” in the “config” folder of the MMAR.

The user does not need to specify the entire PRL but only needs to specify enough information for the actual PRL to be computed by the MMARExecutor when determining search space.

Early termination

In order to prevent continued training to completion for a large preconfigured number of epochs, add the EarlyStop handler to the “train” section in your config_train.json in the MMAR. This handler checks whether the key metric improves after configured number of validations, val_times_threshold. If not, it will stop the training process.

The following is an example:

Copy
Copied!
            

"train": { ... "handlers": [ { "name": "EarlyStop", "args": { "val_times_threshold": 10 } } ], ... }


© Copyright 2020, NVIDIA. Last updated on Feb 2, 2023.