Federated learning user guide
In Clara Train 3.1, federated learning has been enhanced to enable easy server and client deployment through the use of an administration client. This reduces the amount of human coordination involved to set up a federated learning project and provides an admin the ability to deploy the server and client configurations, start the server / clients, abort the training, restart the training, and more. A provisioning tool can be configured to create a startup kit for each site in an encrypted package. These packages can then be delivered to each site ready to go, streamlining the process to provision, start, and operate federated learning.
Provision - Start - Operate
Provision
Lead IT generates the packages for the server / clients / admins, protected with passwords
Start
Site IT each installs their own packages, starts the services, and maps the data location
Operate
Lead scientists / administrators control the federated learning process: deploy MMAR, check statuses, start / abort / shutdown training
One party leads the process of configuring the provisioning tool and using it to generate startup kits for each party in the federated learning training project:
Preparation for using the provisioning tool
A copy of everything needed is located at /opt/nvidia/medical/tools/
inside the Clara Train SDK Docker. If you have
the docker available and can use the provisioning tool from there, skip directly to Provisioning a federated learning project
If you are not using the Clara Train SDK Docker, make sure you have Python3 installed. We recommend you to create and use a virtual environment for using the provisioning tool:
virtualenv --python=python3 --no-site-packages $VIRT_ENV_DIR
source $VIRT_ENV_DIR/bin/activate
Replace $VIRT_ENV_DIR in the commands above with a name for your virtual environment.
Download the provisioning tool for generating the startup kits by going to the “Version History” tab on: https://ngc.nvidia.com/resources/ea-nvidia-clara-train:startup_kits.
If you are not using the Clara Train SDK Docker, install clara_hci-3.1.0-py3-none-any.whl (in your virtual environment if you are using one):
python3 -m pip install clara_hci-3.1.0-py3-none-any.whl
Provisioning a federated learning project
The Federated learning provisioning tool page has details on the contents of the provisioning tool. Edit the project.yml configuration file in the directory with the provisioning tool to meet your project requirements, then run the startup kit with:
python3 provision.py
A directory named “packages” containing each of the generated zip files is created where provision.py is run. One file, audit.pkl, is also created. The console displays a list of zip files and their passwords. We suggest you copy the console output and “packages” folder to a safe location. The passwords shown below are for demonstration purposes only:
===> password: 7EBXVYb80FpcnzCt for server.zip
===> password: xXEI0HJOjD4nyFu7 for flclient1.zip to optional.email@flclient.org
===> password: VOBfHF5ohew0lvm9 for flclient2.zip
===> password: s20yWrOPK5J8w9Ce for email@hello.world.com.zip to email@hello.world.com
===> password: snH1XVZkMEPh5FJG for email@foo.bar.com.zip to email@foo.bar.com
For security reasons, it is recommended to send the password to each participant separately from the package itself.
After generating packages: Distribute and extract
Please let each participant know that the packages are password protected. In Ubuntu, the following command can be used to extract the packages:
unzip -P $PASSWORD $ZIP_FILE -d $DIRECTORY_TO_EXTRACT_TO
Using -d $DIRECTORY_TO_EXTRACT_TO
is optional, and without it, a “startup” folder will be extracted to the current
directory the package is in. Either way, the parent folder containing this “startup” folder ($DIRECTORY_TO_EXTRACT_TO
if the -d
option was used) will be the server, client, or admin client workspace root directory, and the party
running the package will need write access there.
It is important that this “startup” folder is not renamed because the code relies upon this for operation. Please note that a “transfer” directory and deployed MMARs will be created at the level of this “startup” folder. See the section on Standard folder and file structures below for more details.
Please always safeguard .key files!
Federated learning server (server.zip)
One single server will coordinate the federated learning training and be the main hub all clients and administrator clients connect to.
After unzipping the package server.zip, run the start.sh file from the “startup” folder you unzipped to start the server (inside the Clara Train SDK docker).
The rootCA.pem file is pointed to by “ssl_root_cert” in fed_server.json. If you plan to move/copy it to a different place, you will need to modify fed_server.json. The same applies to the other two files, server.crt and server.key.
When launching the FL server docker with docker run
, users also need to expose the port via the -p
option if
not using --net=host
. The port that the server communicates on must also not be blocked by any firewalls.
If clients from other machines cannot connect to the server, make sure that the host name specified when generating the startup kits in the provisioning process resolves to the correct IP. In ubuntu, an entry may need to be added to /etc/hosts with the IP and the host name.
Federated learning client ($CLIENT_NAME.zip)
Each site participating in federated learning training is a client. Each package for a client is named after the client name specified when provisioning the project.
After unzipping the package (for details see After generating packages: Distribute and extract), run start.sh
from the “startup” folder you unzipped to start the client (inside the Clara Train SDK docker).
You can use docker.sh
in the “startup” directory to easily start the Clara Train SDK docker, but make sure
to properly set the data dir, MY_DATA_DIR, and the GPUs to use for your configuration.
Coordination for where to mount the data may be needed depending on where the DATA_ROOT is configured in the MMAR to be deployed.
The rootCA.pem file is pointed to by “ssl_root_cert” in fed_client.json. If you plan to move/copy it to a different place, you will need to modify fed_client.json. The same applies to the other two files, client.crt and client.key.
The client name in your submission to participate this federated learning project is embedded in the CN field of client certificate, which uniquely identifies the participant. As such, please safeguard its private key, client.key.
When a client successfully connects to the FL server, the server and that client will both log a token confirming that the client successfully connected:
Server:
2020-07-07 03:48:49,712 - ClientManager - INFO - Client: New client abcd@127.0.0.1 joined. Sent token: f279157b-df8c-aa1b-8560-2c43efa257bc. Total clients: 1
Client:
2020-07-07 03:48:49,713 - FederatedClient - INFO - Successfully registered client:abcd for exampletraining. Got token:f279157b-df8c-aa1b-8560-2c43efa257bc
If a connection cannot be made, the client will repeatedly try to connect and for each failure log:
Could not connect to server. Setting flag for stopping training. failed to connect to all addresses
If the server is up, you may need to troubleshoot with settings for firewall ports to make sure that the proper permissions are in place. This could require coordination between the lead IT and site IT personnel.
For more information about configuring the network to enable federated learning, see FL network configuration.
Federated learning administration client ($EMAIL.zip)
Each admin client will be able to connect and submit commands to the server. Each admin client package is named after the email specified when provisioning the project, and the same email will need to be entered for authentication when the admin client is launched.
Install the wheel package first with:
python3 -m pip install clara_hci-3.1.0-py3-none-any.whl
or in a python3 virtual environment:
pip3 install clara_hci-3.1.0-py3-none-any.whl
After installation, you can run the fl_admin.sh or fl_admin.bat file to start communicating to the FL server depending on your platform. The FL server must be running and there must be a successful connection between the admin client and the FL server in order for the admin client to start. For the prompt User Name:, enter the email that was used for that admin client in the provisioning of the project.
The Clara Train docker is not required for running the admin client, just python and clara_hci-3.1.0-py3-none-any.whl.
The rootCA.pem file is pointed to by “ca_cert” in fl_admin.sh/fl_admin.bat. If you plan to move/copy it to a different place, you will need to modify the corresponding script. The same applies to the other two files, client.crt and client.key.
The email to participate this FL project is embedded in the CN field of client certificate, which uniquely identifies the participant. As such, please safeguard its private key, client.key.
You will need write access in the directory containing the “startup” folder because the “transfer” directory for uploading files as well as directories created for federated learning runs will live here. For details, see Standard folder and file structures.
Example of running federated learning from the administration client
With all connections between the FL server, FL clients, and administration clients open and all of the parties started successfully as described in the preceding section, Federated learning administration client ($EMAIL.zip), the following is an example of a series of admin commands and their outputs to operate a federated learning project. For a complete list of admin commands, see Federated learning administrator commands.
Check status of FL server and FL client
> check_status server
FL run number has not been set.
FL server status: training not started
Registered clients: 2
client name:flclient1 instance name:flclient1 token: 53b6f87a-79f0-4127-9735-72051de2dad9
client name:flclient2 instance name:flclient2 token: b37644d4-7910-4a11-adf9-14cd790c1151
Done [9325 usecs] 2020-07-07 16:31:15.175153
> check_status client
instance:flclient1 : client name: flclient1 token: 53b6f87a-79f0-4127-9735-72051de2dad9 status: training not started
instance:flclient2 : client name: flclient2 token: b37644d4-7910-4a11-adf9-14cd790c1151 status: training not started
Done [308172 usecs] 2020-07-07 16:31:27.300195
The two commands above are not necessary, but check the status of the server and clients to confirm that they are registered and that the FL server and FL client statuses are all “training not started”.
Please note that in the status of the server above, FL run number has not been set. Let us set a new run number next:
Set run number for FL training
> set_run_number 123
Create a new run folder: run_123
Done [8810 usecs] 2020-07-07 17:05:05.145405
The FL run number is critical to starting and managing FL training projects because all training happens within a run.
You must not start a new run before the previous run is finished completely.
After setting a run number, a folder for that run is created on the server under the root of the server’s workspace at the same level as the startup directory (see the section on Standard folder and file structures below for more details). Initially, this folder will be empty when created for a new run.
In order to start training for a federated learning project, an MMAR with the proper configurations needs to be uploaded and deployed to the server and all the participating clients.
Upload mmar from admin client to FL server
> upload_folder segmentation_ct_spleen
Created folder /path_on_server/fl_server_workspace_root/startup/../transfer/segmentation_ct_spleen
Done [962809 usecs] 2020-07-07 17:16:09.947498
The admin command upload_folder
uploads an MMAR from the administrator’s machine’s transfer directory to the
server’s transfer directory (both at the same level as their respective startup directories). The server will validate
and only save the config and resources directories of the uploaded MMAR to ensure proper structure. The administrator
can run upload_folder
either before or after set_run_number
but in order to deploy the MMAR for the next step,
a run number must be set:
Deploy mmar from transfer directory of FL server to active run on server and client
> deploy segmentation_ct_spleen server
mmar_server has been deployed.
Done [14036 usecs] 2020-07-07 17:27:21.365732
> deploy segmentation_ct_spleen client
instance:flclient1 : MMAR deployed.
instance:flclient2 : MMAR deployed.
Done [521724 usecs] 2020-07-07 17:27:28.337640
The two commands above deployed the specified MMAR that had already been uploaded to the server’s transfer directory into directories corresponding to the run number. In this case, since the run number was set to 123, the segmentation_ct_spleen MMAR was copied into the run_123 directory on the FL server and FL clients (see the section on Standard folder and file structures below for a more visual representation).
The first command deployed the MMAR to the FL server, and the second command deployed the MMAR to all active FL clients. You can also choose to deploy an MMAR to a specific client by specifying the client instance name after the command, for example, the following would only deploy the MMAR to flclient1: deploy segmentation_ct_spleen client flclient1
After the MMAR is deployed to the FL server and participating FL clients, training can be started:
Start training by starting FL server and FL clients
> start server
Server training is starting....
Done [13004 usecs] 2020-07-07 13:27:37.325887
> start client
instance:flclient1 : Starting the client...
instance:flclient2 : Starting the client...
Done [210790 usecs] 2020-07-07 13:27:46.968446
The server status should proceed from “training not started” to “training starting” then to “training started”, and this
can be checked with the command check_status server
. The clients should begin training after the server is in the
“training started” status.
The command start_mgpu client <gpu number> <client name>
can be used to start multiple GPU federated learning
training on clients.
Stopping federated learning training
Do not use ctrl-c to stop the FL server or FL clients, as there are known issues that can leave clients in a state where they are stuck and unable to communicate with the FL server. Instead the following command can be used to abort a client to stop training:
> abort client flclient1
instance:flclient1 : Aborting the client...
Done [511128 usecs] 2020-07-08 13:28:32.235777
You can also use abort client
to stop training on all clients. Please note that abort will not work unless the client
has already successfully started training.
The “Done” displayed on the admin client is referring to the message sending being done. The actual completion of aborting the client may occur after some slight delay.
To then shut down the client entirely and end the FL client process:
> shutdown client flclient1
Are you sure (Y/N): Y
instance:flclient1 : Shutdown the client...
Done [272239 usecs] 2020-07-08 13:33:31.377445
The shutdown client
command can also be issued without a specific client to shut down all clients, and this command
requires the administrator to enter “Y” for confirmation.
Both abort
and shutdown
can be used to stop training and end the FL server process as well, with abort server
and shutdown server
. Please note that you should shut down the clients before the server because the clients receive
commands through the server. If a client refuses to die, wait the duration of the heartbeat to see if it gets dropped,
then after all clients are shut down, you can shut down the server. If a client still refuses to die, IT personnel at
the site may have to kill it.
Server side folder and file structure
/some_path_on_fl_server/fl_server_workspace_root/
admin_audit.log
log.txt
startup/
rootCA.pem
server.crt
server.key
fed_server.json
server.sh
transfer/
run_1/
mmar__server/
config/
models/
resources/
mmar_client1/
config/
models/
evals/
resources/
mmar_client2/
config/
models/
evals/
resources/
cross_validation/
run_2/
......
Client side folder and file structure
/some_path_on_fl_client/fl_client_workspace_root/
log.txt
startup/
rootCA.pem
client.crt
client.key
fed_client.json
server.sh
transfer/
run_1/
mmar_client1/
config/
models/
evals/
resources/
run_2/
mmar_client1/
config/
models/
evals/
resources/
run_3/
......
Administrator side folder and file structure
/some_path_on_fl_admin/fl_administrator_workspace_root/
startup/
clara_hci-3.1.0-py3-none-any.whl
rootCA.pem
client.crt
client.key
fl_admin.bat
fl_admin.sh
transfer/
MMAR_for_uploading/
config/
models/
evals/
resources/
MMAR2_for_uploading/
config/
models/
evals/
resources/
If server dies and then is restarted, intentionally or unintentionally, all clients will have to be restarted.
Running out of memory can happen at any time, especially if the server and clients are running on same machine. This can cause the server the die unexpectedly.
Putting MMARs in the transfer folders without using the upload_folder command or forgetting to delete the models folder inside, a mysterious error may occur when running the deploy command because the MMAR folder is too large to be uploaded and that causes timeout.