NVIDIA Clara Train 3.1
3.1

Federated learning configuration details

FL server configuration file: config_fed_server.json

Example:

Copy
Copied!
            

{ "servers": [ { "name": "spleen_segmentation", "service": { "target": "localhost:8002", "options": [ ["grpc.max_send_message_length", 1000000000], ["grpc.max_receive_message_length", 1000000000] ] }, "ssl_private_key": "resources/certs/server.key", "ssl_cert": "resources/certs/server.crt", "ssl_root_cert": "resources/certs/rootCA.pem", "min_num_clients": 2, "max_num_clients": 100, "wait_after_min_clients": 10, "heart_beat_timeout": 600, "start_round": 0, "num_rounds": 200, "exclude_vars": "dummy", "num_server_workers": 20, "compression": "Gzip" } ], "aggregator": { "name": "ModelAggregator", "args": { "exclude_vars": "dummy", "aggregation_weights": { "client0": 1, "client1": 1.5, "client2": 0.8 } } }, "pre_processors": [ { "name": "ModelEncryptor", "args": {} }, { "name": "DataCompressor", "args": {} } ], "post_processors": [ { "name": "DataDeCompressor", "args": {} }, { "name": "ModelDecryptor", "args": {} } ], "model_saver": { "name": "TFModelSaver", "args": { "exclude_vars": "dummy" } }, "admin_cmd_modules": [ { "name": "TrainingCommandModule" }, { "name": "ValidationCommandModule" }, { "name": "ShellCommandModule" }, { "name": "SystemCommandModule" } ], "result_processors": [ { "name": "ValidateResultProcessor" } ] }

Variable

Description

servers

The list of servers runs the FL service

name

The FL model training task name

target

FL gRPC service location URL

grpc.max_send_message_length

Maximum length of gRPC message send

grpc.max_receive_message_length

Maximum length of gRPC message receive

ssl_private_key

gRPC secure communication private key

ssl_cert

gRPC secure communication SSL certificate

ssl_root_cert

gRPC secure communication trusted root certificate

min_num_clients

Minimum number of clients required for FL model training

max_num_clients

Maximum number of clients required for FL model training

wait_after_min_clients

How many seconds to wait after FL server receives the minimum number of model updates from the clients before starting aggregation, if the number of active clients is greater than min_num_clients

heart_beat_timeout

Number of seconds the FL server waits for the client heartbeat calls before treating a client as a dead client and removing it from the active list

start_round

FL training starting round number

num_rounds

Round number to continue conducting training until

exclude_vars

Excluded variables from the privacy preserving

num_server_workers

Maximum number of workers to support the FL model training

More details on the variables above:

start_round

The current FL server training will start from this number and continue until the value of num_rounds. Depending on the status of the FL training, you can adjust this accordingly.

exclude_vars

This option accepts a string argument, and this string will be interpreted as a regular expression. The exclude_vars regex is then used to filter out server model parameters which are not to be shared with the clients.

num_server_workers

This is used to control how many workers are allocated for the gRPC services from the server side. This may slightly affect the performance of gRPC communication.

aggregator

The “aggergator” key is used to configure the FL server aggregator component. The configuration uses the same approach as the configurations for other components, with “name” for the component name as “args” containing all the parameters. The standard implementation provided is “ModelAggregator”, which is a weighted factor aggregation implementation, considering the factor of training iterations of each FL client. The aggregation weight of each FL client can be controlled depending on the data distribution of the clients. The default value for each client is 1. If the client name is not in the “aggregation_weights” list, it uses the default value. Otherwise, the FL server can give clients more or less aggregation weight accordingly.

“aggregator” also supports the Bring Your Own Aggregation implementation. You can provide your own specific aggregation logic and use the Bring your own components for federated learning approach to plug in the FL server aggregation logic.

FL client configuration file: config_fed_client.json

Example:

Copy
Copied!
            

{ "servers": [ { "name": "prostate_segmentation", "service": { "target": "localhost:8002", "options": [ ["grpc.max_send_message_length", 1000000000], ["grpc.max_receive_message_length", 1000000000] ] } } ], "client": { "local_epochs": 20, "steps_aggregation": 0, "exclude_vars": "dummy", "privacy": { "name": "PercentileProtocol", "args": { "percentile": 75, "gamma": 1 } }, "retry_timeout": 30, "ssl_private_key": "resources/certs/client1.key", "ssl_cert": "resources/certs/client1.crt", "ssl_root_cert": "resources/certs/rootCA.pem" } }

Variable

Description

servers

Same as the server configuration for the FL training task identification and service location URLs

client

The section to describe the FL client

local_epochs

How many epochs to run for each FL training round

steps_aggregation

Instead of sending the updated model after training local_epochs epochs, the FL client will send the updated model after this number of iterations / steps

exclude_vars

Excluded variables from the privacy preserving

privacy

Privacy preserving algorithm

retry_timeout

If the FL client loses the connection to the server, the client will wait for this many seconds before it shuts itself down

ssl_private_key

gRPC secure communication private key

ssl_cert

gRPC secure communication SSL certificate

ssl_root_cert

gRPC secure communication trusted root certificate

More details on the variables above:

servers

Currently this list supports one element only.

privacy

“privacy” configures the privacy preserving component. It uses the standard component configuration approach with a component “name” and parameters in “args”. There are three standard privacy preserving algorithms provided, “PercentileProtocol”, “SVTProtocol”, and “LaplacianProtocol”. You can also bring your own privacy preserving algorithm to plug in to the FL client training like with Bring your own components (BYOC).

In the above example, target is configured as localhost:8002. Based on the network environment in which FL server and clients are deployed, users need to ensure DNS is configured correctly so clients can find the server by the server’s name. Another way for the name resolution without DNS is to edit the client’s /etc/hosts file. Before starting clients, add the server IP address and its name in /etc/hosts file. The following is one example:

Copy
Copied!
            

127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 172.18.0.2 c96cc6ba74ab # Add the following for IP and name of FL server 10.110.11.22 fedserver

When launching the FL server docker with docker run, users also need to expose the port via the -p option if not using --net=host. The highlighted line above may also need to be added to the server’s /etc/hosts file to properly resolve the IP that the server is running on.

If you are using the automatically generated startup kits created by the provisioning tool as described in the Federated learning user guide, the following would have already been taken care of at the creation of the startup kits. The code that generates the startup kits automatically creates certificates for the server and clients and puts them all in their default expected locations.

If you are not using the startup kits and want to create certificates yourself manually, the following is an example of how that could be done:

Copy
Copied!
            

## 1. Server root key and certificate ## 1.1 Server creates the root private key `openssl genrsa -out rootCA.key 2048` Or `openssl genrsa -des3 -out rootCA.key 2048` (with password) ## 1.2 Server creates the self-signed root certificate `openssl req -x509 -new -nodes -key rootCA.key -sha256 -days 1024 -out rootCA.pem` ## 2. Server private key and CSR ## 2.1 Server creates private key `openssl genrsa -out server.key 2048` ## 2.2 Server creates certificate signing request (CSR) `openssl req -new -key server.key -out server.csr` ## 2.3 Server signs the CSR using the root certificate rootCA.pem `openssl x509 -req -in server.csr -CA rootCA.pem -CAkey rootCA.key -CAcreateserial -out server.crt -days 500 -sha256` ## 3. Client certificate ## Important: client must input a common name on command below ## 3.1 Client creates private key `openssl genrsa -out client3.key 2048` ## 3.2 Client creates CSR `openssl req -new -key client3.key -out client3.csr` ## 4. Sign the CSR using the root certificate and place signed certificate in client's config path ## 4.1 Server runs this after getting client3.csr `openssl x509 -req -in client3.csr -CA rootCA.pem -CAkey rootCA.key -CAcreateserial -out client3.crt -days 500 -sha256` ## 4.2 Server gives client client3.crt to place in the client's config path

For development purposes, insecure gRPC communication between the server and clients is supported. This communication mode is not recommended to be used in the federated learning production deployment.

The recommended way to use Clara federated learning is through the administrator tool as described in the Federated learning user guide.

To start running Clara federated learning model training manually without the admin tool, you can use the commands: server_train.sh (from the server machine) and client_train.sh (from the client machine).

There are two options to start the federated learning server, training from scratch, or starting from a previously trained model. By adding the option “MMAR_CKPT=$MMAR_ROOT/models/FL_global_model.ckpt” in the command line, the FL server will start FL training using the pre-trained model “FL_global_model.ckpt”. Without this option, the FL server will start the training from scratch.

Example (training from scratch)

server_train.sh

Copy
Copied!
            

python3 -u -m nvmidl.apps.fed_learn.server.fed_aggregate \ -m $MMAR_ROOT \ -c $CLARA_TRAIN_CONFIG_FILE \ -e $ENVIRONMENT_FILE \ -s $FL_SERVER_CONFIG_FILE \ --set \ secure_train=true

env_server.json

Copy
Copied!
            

{ "PROCESSING_TASK": "segmentation", "MMAR_CKPT_DIR": "models" }


Example (starting from previously trained model)

server_train.sh:

Copy
Copied!
            

python3 -u -m nvmidl.apps.fed_learn.server.fed_aggregate \ -m $MMAR_ROOT \ -c $CLARA_TRAIN_CONFIG_FILE \ -e $ENVIRONMENT_FILE \ -s $FL_SERVER_CONFIG_FILE \ --set \ MMAR_CKPT=$MMAR_ROOT/models/FL_global_model.ckpt \ secure_train=true

env_server.json:

Copy
Copied!
            

{ "MMAR_CKPT": "FL_global_model.ckpt", "PROCESSING_TASK": "segmentation", "MMAR_CKPT_DIR": "models" }

Note

To train from scratch, make sure that “MMAR_CKPT” is not set to a pre-existing checkpoint both in the command to launch training as well as any environment files.

The sequence of starting the federated learning model training is to start the server first, then start the client to join the FL training. The client requires the service to be available during the startup for login and getting the FL token. The server side prints out the status of how many clients have joined the FL training and the FL token each client has been issued.

Note

Federated learning with multiple GPUs uses the same mpirun commands in the example MMARs’ train_2gpu.sh commands. Different clients can choose to run local training with different numbers of GPUs. The FL server then aggregates based on the trained models, which does not depend on the number of GPUs used for training.

Note

Very rarely, FL training clients may not exit gracefully after the training is successfully finished. You can use CTR-C to shutdown the FL client, or kill the FL client process. This will not affect any model training results and the trained model checkpoint will be saved correctly.


© Copyright 2020, NVIDIA. Last updated on Feb 2, 2023.