Federated learning user guide
In Clara Train 3.1, federated learning was enhanced to enable easy server and client deployment through the use of an administration client. This reduces the amount of human coordination involved to set up a federated learning project and provides an admin the ability to deploy the server and client configurations, start the server / clients, abort the training, restart the training, and more. A provisioning tool can be configured to create a startup kit for each site in an encrypted package. These packages can then be delivered to each site ready to go, streamlining the process to provision, start, and operate federated learning.
Clara Train v4.0 and later continue to use this type of architecture to allow deployment through an admin client although there have been changes. For details on changes, see Notes on changes to Clara 4.0 from Clara 3.1.
Provision - Start - Operate
Lead IT generates the packages for the server / clients / admins, protected with passwords
Site IT each installs their own packages, starts the services, and maps the data location
Lead scientists / administrators control the federated learning process: deploy MMAR, check statuses, start / abort / shutdown training
One party leads the process of configuring the provisioning tool and using it to generate startup kits for each party in the federated learning training project:
Preparation for using the provisioning tool
A copy of everything needed is located at
/opt/nvidia/medical/tools/ inside the Clara Train SDK Docker. If you have
the docker available and can use the provisioning tool from there, skip directly to Provisioning a federated learning project
If you are not using the Clara Train SDK Docker, make sure you have Python3 installed. We recommend you to create and use a virtual environment for using the provisioning tool:
virtualenv --python=python3 --no-site-packages $VIRT_ENV_DIR
Replace $VIRT_ENV_DIR in the commands above with a name for your virtual environment.
Download the provisioning tool for generating the startup kits by going to the “Version History” tab on: https://ngc.nvidia.com/resources/ea-nvidia-clara-train:startup_kits.
If you are not using the Clara Train SDK Docker, install clara_hci-4.0.0-py3-none-any.whl (in your virtual environment if you are using one):
python3 -m pip install clara_hci-4.0.0-py3-none-any.whl
Provisioning a federated learning project
The Federated learning provisioning tool page has details on the contents of the provisioning tool. Edit the project.yml configuration file in the directory with the provisioning tool to meet your project requirements, then run the startup kit with:
A directory named “packages” containing each of the generated zip files is created where provision.py is run. One file, audit.pkl, is also created. The console displays a list of zip files and their passwords. We suggest you copy the console output and “packages” folder to a safe location. The passwords shown below are for demonstration purposes only:
===> password: 7EBXVYb80FpcnzCt for server.zip ===> password: xXEI0HJOjD4nyFu7 for flclient1.zip to firstname.lastname@example.org ===> password: VOBfHF5ohew0lvm9 for flclient2.zip ===> password: s20yWrOPK5J8w9Ce for email@example.com to firstname.lastname@example.org ===> password: snH1XVZkMEPh5FJG for email@example.com to firstname.lastname@example.org
For security reasons, it is recommended to send the password to each participant separately from the package itself.
After generating packages: Distribute and extract
Please let each participant know that the packages are password protected. In Ubuntu, the following command can be used to extract the packages:
unzip -P $PASSWORD $ZIP_FILE -d $DIRECTORY_TO_EXTRACT_TO
-d $DIRECTORY_TO_EXTRACT_TO is optional, and without it, a “startup” folder will be extracted to the current
directory the package is in. Either way, the parent folder containing this “startup” folder ($DIRECTORY_TO_EXTRACT_TO
-d option was used) will be the server, client, or admin client workspace root directory, and the party
running the package will need write access there.
It is important that this “startup” folder is not renamed because the code relies upon this for operation. Please note that a “transfer” directory and deployed MMARs will be created at the level of this “startup” folder. See the section on Standard folder and file structures below for more details.
Please always safeguard .key files!
Federated learning server (server.zip)
One single server will coordinate the federated learning training and be the main hub all clients and administrator clients connect to.
After unzipping the package server.zip, run the start.sh file from the “startup” folder you unzipped to start the server (inside the Clara Train SDK docker).
The rootCA.pem file is pointed to by “ssl_root_cert” in fed_server.json. If you plan to move/copy it to a different place, you will need to modify fed_server.json. The same applies to the other two files, server.crt and server.key.
When launching the FL server docker with
docker run, users also need to expose the port via the
-p option if
--net=host. The port that the server communicates on must also not be blocked by any firewalls.
If clients from other machines cannot connect to the server, make sure that the host name specified when generating the startup kits in the provisioning process resolves to the correct IP. In ubuntu, an entry may need to be added to /etc/hosts with the IP and the host name.
Federated learning client ($CLIENT_NAME.zip)
Each site participating in federated learning training is a client. Each package for a client is named after the client name specified when provisioning the project.
After unzipping the package (for details see After generating packages: Distribute and extract), run
from the “startup” folder you unzipped to start the client (inside the Clara Train SDK docker).
You can use
docker.sh in the “startup” directory to easily start the Clara Train SDK docker, but make sure
to properly set the data dir, MY_DATA_DIR, and the GPUs to use for your configuration.
Coordination for where to mount the data may be needed depending on where the DATA_ROOT is configured in the MMAR to be deployed.
The rootCA.pem file is pointed to by “ssl_root_cert” in fed_client.json. If you plan to move/copy it to a different place, you will need to modify fed_client.json. The same applies to the other two files, client.crt and client.key.
The client name in your submission to participate this federated learning project is embedded in the CN field of client certificate, which uniquely identifies the participant. As such, please safeguard its private key, client.key.
When a client successfully connects to the FL server, the server and that client will both log a token confirming that the client successfully connected:
2021-04-21 03:48:49,712 - ClientManager - INFO - Client: New client email@example.com joined. Sent token: f279157b-df8c-aa1b-8560-2c43efa257bc. Total clients: 1
2021-04-21 03:48:49,713 - FederatedClient - INFO - Successfully registered client:abcd for exampletraining. Got token:f279157b-df8c-aa1b-8560-2c43efa257bc
If a connection cannot be made, the client will repeatedly try to connect and for each failure log:
Could not connect to server. Setting flag for stopping training. failed to connect to all addresses
If the server is up, you may need to troubleshoot with settings for firewall ports to make sure that the proper permissions are in place. This could require coordination between the lead IT and site IT personnel.
For more information about configuring the network to enable federated learning, see FL network configuration.
Federated learning administration client ($EMAIL.zip)
Each admin client will be able to connect and submit commands to the server. Each admin client package is named after the email specified when provisioning the project, and the same email will need to be entered for authentication when the admin client is launched.
Install the wheel package first with:
python3 -m pip install clara_hci-4.0.0-py3-none-any.whl
or in a python3 virtual environment:
pip3 install clara_hci-4.0.0-py3-none-any.whl
After installation, you can run the fl_admin.sh or fl_admin.bat file to start communicating to the FL server depending on your platform. The FL server must be running and there must be a successful connection between the admin client and the FL server in order for the admin client to start. For the prompt User Name:, enter the email that was used for that admin client in the provisioning of the project.
The Clara Train docker is not required for running the admin client, just python and clara_hci-4.0.0-py3-none-any.whl.
The rootCA.pem file is pointed to by “ca_cert” in fl_admin.sh/fl_admin.bat. If you plan to move/copy it to a different place, you will need to modify the corresponding script. The same applies to the other two files, client.crt and client.key.
The email to participate this FL project is embedded in the CN field of client certificate, which uniquely identifies the participant. As such, please safeguard its private key, client.key.
You will need write access in the directory containing the “startup” folder because the “transfer” directory for uploading files as well as directories created for federated learning runs will live here. For details, see Standard folder and file structures.
Example of running federated learning from the administration client
With all connections between the FL server, FL clients, and administration clients open and all of the parties started successfully as described in the preceding section, Federated learning administration client ($EMAIL.zip), the following is an example of a series of admin commands and their outputs to operate a federated learning project. For a complete list of admin commands, see Federated learning administrator commands.
Check status of FL server and FL client
> check_status server FL run number has not been set. FL server status: training not started Registered clients: 2 ------------------------------------------------------------------------------------------------- | CLIENT NAME | TOKEN | LAST ACCEPTED ROUND | CONTRIBUTION COUNT | ------------------------------------------------------------------------------------------------- | org1-a | f4168b29-eaa1-40c1-86f0-d73be5a58a32 | | 0 | | org2 | daaa2d9a-95ab-4dc2-bde0-1d1367d8670f | | 0 | ------------------------------------------------------------------------------------------------- Done [9325 usecs] 2021-04-21 16:31:15.175153 > check_status client instance:flclient1 : client name: flclient1 token: 53b6f87a-79f0-4127-9735-72051de2dad9 status: training not started instance:flclient2 : client name: flclient2 token: b37644d4-7910-4a11-adf9-14cd790c1151 status: training not started Done [308172 usecs] 2021-04-21 16:31:27.300195
The two commands above are not necessary, but check the status of the server and clients to confirm that they are registered and that the FL server and FL client statuses are all “training not started”.
Please note that in the status of the server above, FL run number has not been set. Let us set a new run number next:
Set run number for FL training
> set_run_number 123 Create a new run folder: run_123 Done [8810 usecs] 2021-04-21 17:05:05.145405
The FL run number is critical to starting and managing FL training projects because all training happens within a run.
You must not start a new run before the previous run is finished completely. Also, please note that deleting a run number deletes all the corresponding files associated with that run on the server and all clients.
After setting a run number, a folder for that run is created on the server under the root of the server’s workspace at the same level as the startup directory (see the section on Standard folder and file structures below for more details). Initially, this folder will be empty when created for a new run.
In order to start training for a federated learning project, an MMAR with the proper configurations needs to be uploaded and deployed to the server and all the participating clients.
Upload mmar from admin client to FL server
> upload_folder segmentation_ct_spleen Created folder /path_on_server/fl_server_workspace_root/startup/../transfer/segmentation_ct_spleen Done [962809 usecs] 2021-04-21 17:16:09.947498
The admin command
upload_folder uploads an MMAR from the administrator’s machine’s transfer directory to the
server’s transfer directory (both at the same level as their respective startup directories). The server will validate
and only save the config and resources directories of the uploaded MMAR to ensure proper structure. The administrator
upload_folder either before or after
set_run_number but in order to deploy the MMAR for the next step,
a run number must be set:
Deploy mmar from transfer directory of FL server to active run on server and client
> deploy segmentation_ct_spleen server mmar_server has been deployed. Done [14036 usecs] 2021-04-21 17:27:21.365732 > deploy segmentation_ct_spleen client instance:flclient1 : MMAR deployed. instance:flclient2 : MMAR deployed. Done [521724 usecs] 2021-04-21 17:27:28.337640
The two commands above deployed the specified MMAR that had already been uploaded to the server’s transfer directory into directories corresponding to the run number. In this case, since the run number was set to 123, the segmentation_ct_spleen MMAR was copied into the run_123 directory on the FL server and FL clients (see the section on Standard folder and file structures below for a more visual representation).
The first command deployed the MMAR to the FL server, and the second command deployed the MMAR to all active FL clients. You can also choose to deploy an MMAR to a specific client by specifying the client instance name after the command, for example, the following would only deploy the MMAR to flclient1: deploy segmentation_ct_spleen client flclient1
After the MMAR is deployed to the FL server and participating FL clients, training can be started:
Start training by starting FL server and FL clients
> start server Server training is starting.... Done [13004 usecs] 2021-04-21 13:27:37.325887 > start client instance:flclient1 : Starting the client... instance:flclient2 : Starting the client... Done [210790 usecs] 2021-04-21 13:27:46.968446
The server status should proceed from “training not started” to “training starting” then to “training started”, and this
can be checked with the command
check_status server. The clients should begin training after the server is in the
“training started” status.
start_mgpu client <gpu number> <client name> can be used to start multiple GPU federated learning
training on clients.
Cross site validation
If cross site validation is configured, this step will automatically run after training completes with the clients requesting models from the server to validate locally before submitting the results. Cross site validation is enabled by configuring the variable “cross_validate”: true in config_fed_client.json in the MMAR’s config directory and setting the relevant arguments. If new clients join the training but are not in the training status, cross site validation may not end clients may need to be aborted manually to escape this situation.
Stopping federated learning training
Do not use ctrl-c to stop the FL server or FL clients, as there are known issues that can leave clients in a state where they are stuck and unable to communicate with the FL server. Instead the following command can be used to abort a client to stop training:
> abort client flclient1 instance:flclient1 : Aborting the client... Done [511128 usecs] 2020-07-08 13:28:32.235777
You can also use
abort client to stop training on all clients. Please note that abort will not work unless the client
has already successfully started training.
The “Done” displayed on the admin client is referring to the message sending being done. The actual completion of aborting the client may occur after some slight delay.
To then shut down the client entirely and end the FL client process:
> shutdown client flclient1 Are you sure (Y/N): Y instance:flclient1 : Shutdown the client... Done [272239 usecs] 2020-07-08 13:33:31.377445
shutdown client command can also be issued without a specific client to shut down all clients, and this command
requires the administrator to enter “Y” for confirmation.
shutdown can be used to stop training and end the FL server process as well, with
shutdown server. Please note that you should shut down the clients before the server because the clients receive
commands through the server. If a client refuses to die, wait the duration of the heartbeat to see if it gets dropped,
then after all clients are shut down, you can shut down the server. If a client still refuses to die, IT personnel at
the site may have to kill it.
Server side folder and file structure
/some_path_on_fl_server/fl_server_workspace_root/ admin_audit.log log.txt startup/ rootCA.pem server.crt server.key fed_server.json server.sh transfer/ run_1/ mmar__server/ config/ models/ resources/ mmar_client1/ config/ models/ evals/ resources/ mmar_client2/ config/ models/ evals/ resources/ cross_validation/ run_2/ ......
Client side folder and file structure
/some_path_on_fl_client/fl_client_workspace_root/ log.txt startup/ rootCA.pem client.crt client.key fed_client.json server.sh transfer/ run_1/ mmar_client1/ config/ models/ evals/ resources/ run_2/ mmar_client1/ config/ models/ evals/ resources/ run_3/ ......
Administrator side folder and file structure
/some_path_on_fl_admin/fl_administrator_workspace_root/ startup/ clara_hci-4.0.0-py3-none-any.whl rootCA.pem client.crt client.key fl_admin.bat fl_admin.sh transfer/ MMAR_for_uploading/ config/ models/ evals/ resources/ MMAR2_for_uploading/ config/ models/ evals/ resources/
Notes on changes to Clara 4.0 from Clara 3.1
For more information in the response,
check_status servernow returns a LAST ACCEPTED ROUND and CONTRIBUTION COUNT for each client.
The “epochs” number reported on each client now only tracks the local epoch number.
If users want to run FL and have a working MMAR for local training. They must remove: “MMAR_CKPT”: “models/model.pt”, and can optionally remove: “MMAR_TORCHSCRIPT”: “models/model.ts”.
The log messages have been moved to be controlled more by setting the logger level in log.config.
The config_fed_server.json and config_fed_client.json files have some changes required, which can be seen under the Federated learning configuration details.
If server dies and then is restarted, intentionally or unintentionally, all clients will have to be restarted.
Running out of memory can happen at any time, especially if the server and clients are running on same machine. This can cause the server the die unexpectedly.
Putting MMARs in the transfer folders without using the upload_folder command or forgetting to delete the models folder inside, a mysterious error may occur when running the deploy command because the MMAR folder is too large to be uploaded and that causes timeout.