NVIDIA Clara Train 3.1
3.1

AutoML user guide

Add this line to the config file:

Copy
Copied!
            

"lr_search": [0.0001, 0.001]

This line says to perform a search for the learning rate in the range of 0.0001 to 0.001.

Add the “search” section to the “model” component:

Copy
Copied!
            

"model": { "name": "SegAhnet", "args": { "num_classes": 2, "if_use_psp": false, "pretrain_weight_name": "{PRETRAIN_WEIGHTS_FILE}", "plane": "z", "final_activation": "softmax", "n_spatial_dim": 3 }, "search": [ { "args": ["if_use_psp", "final_activation"], "domain": "net", "type": "enum", "targets": [[true, "softmax"], [false, "sigmoid"]] } ] }

Please note that the “search” section is added at the same level as the existing “name” and “args” for the component definition. Also, in this example, the args “if_use_psp” and “final_activation” are grouped so only the combinations of [true, “softmax”] and [false, “sigmoid”] are tried. This is freely customizable and more detailed explanations are in sections below, but a configuration to search on the args individually would require a “search” section like:

Copy
Copied!
            

"search": [ { "args": "if_use_psp", "domain": "net", "type": "enum", "targets": [[true], [false]] }, { "args": "final_activation", "domain": "net", "type": "enum", "targets": [["softmax"], ["sigmoid"]] } ]


Add the “search” section to any transform component:

Copy
Copied!
            

{ "name": "RandomAxisFlip", "args": { "fields": [ "image", "label" ], "probability": 0.0 }, "search": [ { "domain": "transform", "type": "float", "args": ["probability"], "targets": [0.0, 1.0] } ] }

Attention

The “targets” in this example is list to specify a continuous range as opposed to a list of lists to denote discrete values.


You make the component’s init args searchable by adding it to the “search” section of the component, which is a defined as a list.

Each item in the “search” list specifies the search ranges for one or more args:

  • domain - the search domain of the args. Currently lr, net, transform.

  • type - data type of the search. Currently float, enum.

  • args - list of arg names. They must be existing args in the component’s init args.

  • targets- the search range candidates. Its format depends on the “type”.

Float type

For the “float” type, “targets” is a list of two numbers - min and max of the range. If multiple args are specified in the “args”, the same search result (which is a float number) is applied to all of them.

Enum type

For the “enum” type, “targets” is a list of choices, and each choice is a list of values, one for each arg in the args list.

Examine this following example:

Copy
Copied!
            

"search": [ { "args": ["if_use_psp", "final_activation"], "domain": "net", "type": "enum", "targets": [[true, "softmax"], [false, "sigmoid"], [true, "sigmoid"]] } ]

Two args are specified in “args” (“if_use_psp” and “final_activation”). There are three target choices:

  • Choice 0: [true, “softmax”]

  • Choice 1: [false, “sigmoid”]

  • Choice 2: [true, “sigmoid”]

If the search result is choice 2, then true is assigned to “if_use_psp”, and “sigmoid” is assigned to “final_activation”.

This supports the use case of args being related and needing to be searched together.

automl_train_round.sh

To start Clara Train based AutoML, simply run automl.sh in the “commands” folder of the MMAR.

automl.sh is a very simple shell script:

Copy
Copied!
            

#!/usr/bin/env bash my_dir="$(dirname "$0")" . $my_dir/set_env.sh echo "MMAR_ROOT set to $MMAR_ROOT" additional_options="$*" # Data list containing all data python -u -m nvmidl.apps.automl.train \ -m $MMAR_ROOT \ --set \ run_id=a \ workers=0,1,2:1,2:1,3 \ ${additional_options}

The most important details to note are the settings of run_id and workers. The script sets their default values, but you can overwrite them by specifying them explicitly in the command line.

Specify run_id

As described above, run_id represents one AutoML experiment. Each experiment must have a unique run_id. To specify a run_id, simply append the following to the command line when running automl.sh:

Copy
Copied!
            

run_id=<run_id>


Specify workers

You must define how many workers to use and assign GPU devices to each worker. The syntax is this:

Copy
Copied!
            

workers=<gpu_id_list_for_worker1>:<gpu_id_list_for_worker2>:...

For each worker, you specify a list of GPU device IDs, separated by commas. Worker specs are separated by colons.

To output the contents of trace.log to the main console, append the following to the command when running automl.sh:

Copy
Copied!
            

traceout=both


Examples for running AutoML

To run AutoML with run ID “test1” and two workers assigned to GPU 0 and 1 respectively:

Copy
Copied!
            

automl.sh run_id=test1 workers=0:1

To run AutoML with run ID “test2” and two workers, with worker 1 assigned to GPU 0 and 1, and worker 2 assigned to GPU 2 and 3:

Copy
Copied!
            

automl.sh run_id=test2 workers=0,1:2,3

Note

You can assign the same GPU to multiple workers, provided the GPU is big enough for all these workers at the same time.

For example, if you want 4 workers to share two GPUs:

Copy
Copied!
            

automl.sh run_id=test3 workers=0:0:1:1


AutoML worker names

Workers are named like:

Copy
Copied!
            

W<workerId>

where workerId is an integer starting from 1 (e.g. W1, W2, etc.).

Note

Worker names are used as a prefix to jobs’ MMAR names.


How to configure workers efficiently for AutoML?

When multiple GPUs are available, how can they be used efficiently? Should each job be executed with multiple GPUs, or should each job be assigned a single GPU? The answer is: it depends.

If multiple recommendations are produced each time by the controller, it might be more efficient to run each job with a single GPU. You still keep all GPUs busy since all jobs are run in parallel, and you can avoid cross-device synchronization overhead of a multi-gpu training (in case of horovod).

However, if the controller always produces a single recommendation each time based on the previous job score, then there would be no parallel job execution. In this case, you should arrange to run the job with multiple GPUs.

If the controller is implemented in a phased approach, with multiple recommendations produced then single recommendations produced, it can get tricky to optimally configure the workers.

Custom name for config_automl.json

In Clara Train 3.1, AutoML has been enhanced to support user specified names for the AutoML config file via the command line, as highlighted in this example of automl.sh:

Copy
Copied!
            

... python -u -m nvmidl.apps.automl.train \ -m $MMAR_ROOT \ --automlconf my_custom_config_automl.json \ --set \ run_id=a \ workers=0:1 \ traceout=both \ trainconf=config_train_for_automl.json \ ${additional_options}

Note

my_custom_config_automl.json must be in the MMAR’s “config” folder!

When AutoML is started, the file name of the AutoML config file that is used will be printed. Make sure it is what you specified.

© Copyright 2020, NVIDIA. Last updated on Feb 2, 2023.