Federated learning user guide

Introduction

In Clara Train 3.1, federated learning has been enhanced to enable easy server and client deployment through the use of an administration client. This reduces the amount of human coordination involved to set up a federated learning project and provides an admin the ability to deploy the server and client configurations, start the server / clients, abort the training, restart the training, and more. A provisioning tool can be configured to create a startup kit for each site in an encrypted package. These packages can then be delivered to each site ready to go, streamlining the process to provision, start, and operate federated learning.

Provision - Start - Operate

Provision

Lead IT generates the packages for the server / clients / admins, protected with passwords

Start

Site IT each installs their own packages, starts the services, and maps the data location

Operate

Lead scientists / administrators control the federated learning process: deploy MMAR, check statuses, start / abort / shutdown training

Provision: Configure and generate packages for the server, clients, and admins

One party leads the process of configuring the provisioning tool and using it to generate startup kits for each party in the federated learning training project:

Preparation for using the provisioning tool

A copy of everything needed is located at /opt/nvidia/medical/tools/ inside the Clara Train SDK Docker. If you have the docker available and can use the provisioning tool from there, skip directly to Provisioning a federated learning project

If you are not using the Clara Train SDK Docker, make sure you have Python3 installed. We recommend you to create and use a virtual environment for using the provisioning tool:

virtualenv --python=python3 --no-site-packages $VIRT_ENV_DIR
source $VIRT_ENV_DIR/bin/activate

Tip

Replace $VIRT_ENV_DIR in the commands above with a name for your virtual environment.

Download the provisioning tool for generating the startup kits by going to the “Version History” tab on: https://ngc.nvidia.com/resources/ea-nvidia-clara-train:startup_kits.

If you are not using the Clara Train SDK Docker, install clara_hci-3.1.0-py3-none-any.whl (in your virtual environment if you are using one):

python3 -m pip install clara_hci-3.1.0-py3-none-any.whl

Provisioning a federated learning project

The Federated learning provisioning tool page has details on the contents of the provisioning tool. Edit the project.yml configuration file in the directory with the provisioning tool to meet your project requirements, then run the startup kit with:

python3 provision.py

A directory named “packages” containing each of the generated zip files is created where provision.py is run. One file, audit.pkl, is also created. The console displays a list of zip files and their passwords. We suggest you copy the console output and “packages” folder to a safe location. The passwords shown below are for demonstration purposes only:

===> password: 7EBXVYb80FpcnzCt for server.zip
===> password: xXEI0HJOjD4nyFu7 for flclient1.zip to optional.email@flclient.org
===> password: VOBfHF5ohew0lvm9 for flclient2.zip
===> password: s20yWrOPK5J8w9Ce for email@hello.world.com.zip to email@hello.world.com
===> password: snH1XVZkMEPh5FJG for email@foo.bar.com.zip to email@foo.bar.com

Tip

For security reasons, it is recommended to send the password to each participant separately from the package itself.

After generating packages: Distribute and extract

Please let each participant know that the packages are password protected. In Ubuntu, the following command can be used to extract the packages:

unzip -P $PASSWORD $ZIP_FILE -d $DIRECTORY_TO_EXTRACT_TO

Using -d $DIRECTORY_TO_EXTRACT_TO is optional, and without it, a “startup” folder will be extracted to the current directory the package is in. Either way, the parent folder containing this “startup” folder ($DIRECTORY_TO_EXTRACT_TO if the -d option was used) will be the server, client, or admin client workspace root directory, and the party running the package will need write access there.

Note

It is important that this “startup” folder is not renamed because the code relies upon this for operation. Please note that a “transfer” directory and deployed MMARs will be created at the level of this “startup” folder. See the section on Standard folder and file structures below for more details.

Start: Instructions for each participant to start running FL with their startup kits

Attention

Please always safeguard .key files!

Federated learning server (server.zip)

One single server will coordinate the federated learning training and be the main hub all clients and administrator clients connect to.

After unzipping the package server.zip, run the start.sh file from the “startup” folder you unzipped to start the server (inside the Clara Train SDK docker).

The rootCA.pem file is pointed to by “ssl_root_cert” in fed_server.json. If you plan to move/copy it to a different place, you will need to modify fed_server.json. The same applies to the other two files, server.crt and server.key.

Note

When launching the FL server docker with docker run, users also need to expose the port via the -p option if not using --net=host. The port that the server communicates on must also not be blocked by any firewalls.

If clients from other machines cannot connect to the server, make sure that the host name specified when generating the startup kits in the provisioning process resolves to the correct IP. In ubuntu, an entry may need to be added to /etc/hosts with the IP and the host name.

Federated learning client ($CLIENT_NAME.zip)

Each site participating in federated learning training is a client. Each package for a client is named after the client name specified when provisioning the project.

After unzipping the package (for details see After generating packages: Distribute and extract), run start.sh from the “startup” folder you unzipped to start the client (inside the Clara Train SDK docker).

Tip

You can use docker.sh in the “startup” directory to easily start the Clara Train SDK docker, but make sure to properly set the data dir, MY_DATA_DIR, and the GPUs to use for your configuration.

Note

Coordination for where to mount the data may be needed depending on where the DATA_ROOT is configured in the MMAR to be deployed.

The rootCA.pem file is pointed to by “ssl_root_cert” in fed_client.json. If you plan to move/copy it to a different place, you will need to modify fed_client.json. The same applies to the other two files, client.crt and client.key.

The client name in your submission to participate this federated learning project is embedded in the CN field of client certificate, which uniquely identifies the participant. As such, please safeguard its private key, client.key.

When a client successfully connects to the FL server, the server and that client will both log a token confirming that the client successfully connected:

Server:

2020-07-07 03:48:49,712 - ClientManager - INFO - Client: New client abcd@127.0.0.1 joined. Sent token: f279157b-df8c-aa1b-8560-2c43efa257bc.  Total clients: 1

Client:

2020-07-07 03:48:49,713 - FederatedClient - INFO - Successfully registered client:abcd for exampletraining. Got token:f279157b-df8c-aa1b-8560-2c43efa257bc

If a connection cannot be made, the client will repeatedly try to connect and for each failure log:

Could not connect to server. Setting flag for stopping training. failed to connect to all addresses

If the server is up, you may need to troubleshoot with settings for firewall ports to make sure that the proper permissions are in place. This could require coordination between the lead IT and site IT personnel.

For more information about configuring the network to enable federated learning, see FL network configuration.

Federated learning administration client ($EMAIL.zip)

Each admin client will be able to connect and submit commands to the server. Each admin client package is named after the email specified when provisioning the project, and the same email will need to be entered for authentication when the admin client is launched.

Install the wheel package first with:

python3 -m pip install clara_hci-3.1.0-py3-none-any.whl

or in a python3 virtual environment:

pip3 install clara_hci-3.1.0-py3-none-any.whl

After installation, you can run the fl_admin.sh or fl_admin.bat file to start communicating to the FL server depending on your platform. The FL server must be running and there must be a successful connection between the admin client and the FL server in order for the admin client to start. For the prompt User Name:, enter the email that was used for that admin client in the provisioning of the project.

Note

The Clara Train docker is not required for running the admin client, just python and clara_hci-3.1.0-py3-none-any.whl.

The rootCA.pem file is pointed to by “ca_cert” in fl_admin.sh/fl_admin.bat. If you plan to move/copy it to a different place, you will need to modify the corresponding script. The same applies to the other two files, client.crt and client.key.

The email to participate this FL project is embedded in the CN field of client certificate, which uniquely identifies the participant. As such, please safeguard its private key, client.key.

Attention

You will need write access in the directory containing the “startup” folder because the “transfer” directory for uploading files as well as directories created for federated learning runs will live here. For details, see Standard folder and file structures.

Operate: Running federated learning as an administrator

Example of running federated learning from the administration client

With all connections between the FL server, FL clients, and administration clients open and all of the parties started successfully as described in the preceding section, Federated learning administration client ($EMAIL.zip), the following is an example of a series of admin commands and their outputs to operate a federated learning project. For a complete list of admin commands, see Federated learning administrator commands.

Check status of FL server and FL client

> check_status server
FL run number has not been set.
FL server status: training not started
Registered clients: 2
client name:flclient1    instance name:flclient1    token: 53b6f87a-79f0-4127-9735-72051de2dad9
client name:flclient2       instance name:flclient2 token: b37644d4-7910-4a11-adf9-14cd790c1151

Done [9325 usecs] 2020-07-07 16:31:15.175153

> check_status client
instance:flclient1  : client name: flclient1        token: 53b6f87a-79f0-4127-9735-72051de2dad9     status: training not started
instance:flclient2 : client name: flclient2 token: b37644d4-7910-4a11-adf9-14cd790c1151     status: training not started

Done [308172 usecs] 2020-07-07 16:31:27.300195

The two commands above are not necessary, but check the status of the server and clients to confirm that they are registered and that the FL server and FL client statuses are all “training not started”.

Please note that in the status of the server above, FL run number has not been set. Let us set a new run number next:

Set run number for FL training

> set_run_number 123
Create a new run folder: run_123
Done [8810 usecs] 2020-07-07 17:05:05.145405

The FL run number is critical to starting and managing FL training projects because all training happens within a run.

Attention

You must not start a new run before the previous run is finished completely.

After setting a run number, a folder for that run is created on the server under the root of the server’s workspace at the same level as the startup directory (see the section on Standard folder and file structures below for more details). Initially, this folder will be empty when created for a new run.

In order to start training for a federated learning project, an MMAR with the proper configurations needs to be uploaded and deployed to the server and all the participating clients.

Upload mmar from admin client to FL server

> upload_folder segmentation_ct_spleen
Created folder /path_on_server/fl_server_workspace_root/startup/../transfer/segmentation_ct_spleen
Done [962809 usecs] 2020-07-07 17:16:09.947498

The admin command upload_folder uploads an MMAR from the administrator’s machine’s transfer directory to the server’s transfer directory (both at the same level as their respective startup directories). The server will validate and only save the config and resources directories of the uploaded MMAR to ensure proper structure. The administrator can run upload_folder either before or after set_run_number but in order to deploy the MMAR for the next step, a run number must be set:

Deploy mmar from transfer directory of FL server to active run on server and client

> deploy segmentation_ct_spleen server
mmar_server has been deployed.
Done [14036 usecs] 2020-07-07 17:27:21.365732

> deploy segmentation_ct_spleen client
instance:flclient1 : MMAR deployed.
instance:flclient2 : MMAR deployed.

Done [521724 usecs] 2020-07-07 17:27:28.337640

The two commands above deployed the specified MMAR that had already been uploaded to the server’s transfer directory into directories corresponding to the run number. In this case, since the run number was set to 123, the segmentation_ct_spleen MMAR was copied into the run_123 directory on the FL server and FL clients (see the section on Standard folder and file structures below for a more visual representation).

Note

The first command deployed the MMAR to the FL server, and the second command deployed the MMAR to all active FL clients. You can also choose to deploy an MMAR to a specific client by specifying the client instance name after the command, for example, the following would only deploy the MMAR to flclient1: deploy segmentation_ct_spleen client flclient1

After the MMAR is deployed to the FL server and participating FL clients, training can be started:

Start training by starting FL server and FL clients

> start server
Server training is starting....
Done [13004 usecs] 2020-07-07 13:27:37.325887

> start client
instance:flclient1 : Starting the client...
instance:flclient2 : Starting the client...

Done [210790 usecs] 2020-07-07 13:27:46.968446

The server status should proceed from “training not started” to “training starting” then to “training started”, and this can be checked with the command check_status server. The clients should begin training after the server is in the “training started” status.

Note

The command start_mgpu client <gpu number> <client name> can be used to start multiple GPU federated learning training on clients.

Stopping federated learning training

Do not use ctrl-c to stop the FL server or FL clients, as there are known issues that can leave clients in a state where they are stuck and unable to communicate with the FL server. Instead the following command can be used to abort a client to stop training:

> abort client flclient1
instance:flclient1 : Aborting the client...

Done [511128 usecs] 2020-07-08 13:28:32.235777

You can also use abort client to stop training on all clients. Please note that abort will not work unless the client has already successfully started training.

Note

The “Done” displayed on the admin client is referring to the message sending being done. The actual completion of aborting the client may occur after some slight delay.

To then shut down the client entirely and end the FL client process:

> shutdown client flclient1
Are you sure (Y/N): Y
instance:flclient1 : Shutdown the client...

Done [272239 usecs] 2020-07-08 13:33:31.377445

The shutdown client command can also be issued without a specific client to shut down all clients, and this command requires the administrator to enter “Y” for confirmation.

Both abort and shutdown can be used to stop training and end the FL server process as well, with abort server and shutdown server. Please note that you should shut down the clients before the server because the clients receive commands through the server. If a client refuses to die, wait the duration of the heartbeat to see if it gets dropped, then after all clients are shut down, you can shut down the server. If a client still refuses to die, IT personnel at the site may have to kill it.

Standard folder and file structures for Admin FL training:

Server side folder and file structure

/some_path_on_fl_server/fl_server_workspace_root/
    admin_audit.log
    log.txt
    startup/
        rootCA.pem
        server.crt
        server.key
        fed_server.json
        server.sh
    transfer/
    run_1/
        mmar__server/
            config/
            models/
            resources/
        mmar_client1/
            config/
            models/
            evals/
            resources/
        mmar_client2/
            config/
            models/
            evals/
            resources/
        cross_validation/
    run_2/
        ......

Client side folder and file structure

/some_path_on_fl_client/fl_client_workspace_root/
    log.txt
    startup/
        rootCA.pem
        client.crt
        client.key
        fed_client.json
        server.sh
    transfer/
    run_1/
        mmar_client1/
            config/
            models/
            evals/
            resources/
    run_2/
        mmar_client1/
            config/
            models/
            evals/
            resources/
    run_3/
        ......

Administrator side folder and file structure

/some_path_on_fl_admin/fl_administrator_workspace_root/
    startup/
        clara_hci-3.1.0-py3-none-any.whl
        rootCA.pem
        client.crt
        client.key
        fl_admin.bat
        fl_admin.sh
    transfer/
        MMAR_for_uploading/
            config/
            models/
            evals/
            resources/
        MMAR2_for_uploading/
            config/
            models/
            evals/
            resources/

Known issues

  1. If server dies and then is restarted, intentionally or unintentionally, all clients will have to be restarted.

  2. Running out of memory can happen at any time, especially if the server and clients are running on same machine. This can cause the server the die unexpectedly.

  3. Putting MMARs in the transfer folders without using the upload_folder command or forgetting to delete the models folder inside, a mysterious error may occur when running the deploy command because the MMAR folder is too large to be uploaded and that causes timeout.