Federated learning FAQ

Here is a list of frequently asked questions that may not have been answered by Federated learning:

1. What is the federated learning system’s architecture?

The federated learning system’s architecture is based on client / server architecture. There is a single server that coordinates the training activities among all clients. A client is deployed in each participating institution.

2. How is the model trained with federated learning?

The model is trained through multiple rounds (the number of rounds is configurable). In each round, each participating client fetches the model weights from the server and then starts training with its local training dataset (local training uses Clara Train SDK). At the end of the training, the client reports its weights update info to the server. Once the required number of updates are received from the clients, the server performs aggregation to produce new model weights, which will be used by clients for the next round. Each client validates the model and keeps their own “best” model over time.

The training continues in this fashion until the configured number of rounds is reached. At the end, each client will have obtained a local best trained model.

3. What is communicated between clients and the server?

Clients only send model update information to the server. Clients never send training data to the server.

4. How do FL clients get identified?

The federated learning clients are identified by a dynamically generated FL token issued by the server during runtime. When an FL client first joins an FL training, it first needs to send a login request to the FL server. During the login process, the FL server and client need to exchange SSL certificates for bi-directional authentication. Once the authentication is successful, the FL server sends an FL token to the client. The FL client will use this FL token to identify itself for all following requests for the global model and all model updating operations.

5. How do I set up secure federated learning training?

FL uses self signed SSL certificates for security authentication. First, a Root certificate is created from the FL server side. The server and all the participating clients need to create their own private keys and CSR. Then the CSR needs to be signed by the server to generate their own SSL certificates.

Instructions on how to create the self signed SSL certificates can be found at Federated learning deployment security.

6. Can I run multiple FL clients from the same machine?

Yes. The FL clients are identified by FL token, not machine IP. Each FL client will have its own FL token.

7. What happens if the FL server crashes?

There are two scenarios for the FL server crashing during the FL training. If the server crashes when the FL client is trying to connect to the server for model exchange, the FL client will continue to attempt connecting to the server for up to 30 seconds. If the server is still down after that, FL client will shut itself down. If the server crashes during the FL client model training, as long as the server restarts before the FL client attempts model updating, it will have no impact to the FL clients.

When restarting the FL server, you can find the previous training round number from the previous log. Then you can choose to train from scratch or continuously using previous training model.

8. What happens if an federated learning client crashes?

Federated learning clients will send a heartbeat call to the FL server once every minute. If an FL client crashes and the FL server does not get a heartbeat from that client for 10 minutes, the FL server will remove that client from the training client list.

9. Can FL clients quit federated learning training?

Yes, an FL client can quit from FL training any time. The FL server will automatically remove the FL client after it quits.

10. What happens if an FL client joins during the FL training?

An FL client can join the FL training any time during the whole FL training. As long as the participating FL clients are still within the maximum number of clients, the client can join. The newly joined client will get the current round of global model for the training and will contribute to the current global model.

11. What if the number of participating FL clients is below the minimum number of clients required?

When an FL client passes authentication, it can request the current round of the global model and starts the FL training right away. There is no need to wait for other clients. Once the client finishes its own training, it will send the update to the server for aggregation. However, if the server does not receive enough updates from other clients, the FL server will not start the next round of FL training. The finished FL client will be waiting for the next round’s model.

12. What happens if more than the minimum numbers of FL clients submit an updated model?

The FL server begins model aggregation after accepting updates from the minimum number of FL clients required. The updates from the extra clients will be discarded. All the clients will get the next round of the global model to start the next round FL training.

13. How does FL support continuous training?

Use the MMAR_CKPT option in server_train.sh to continue FL training using a pre-trained model.

14. Does the federated learning server need a GPU?

No, there is no need to have GPU on the server side for the FL server to deploy.

15. Can I run the FL server on AWS while running the FL client within my institution?

Yes, the steps to set up the AWS FL server instance are:

  • Set up an AWS instance with a minimum of t2.2xlarge instance type

  • run:

    docker login nvcr.io
    
    docker pull nvcr.io/nvidian/dlmed/clara-train-sdk:v2.0
    
  • Set up the ROOT certificate and the server SSL certificate (make sure to use the AWS instance name to create the

SSL server certificate, e.g, ec2-3-99-123-456.compute-1.amazonaws.com) - Create the client certificate using the ROOT certificate to sign - Configure config_fed_server.json and config_fed_client.json properly - Start the server and clients

16. What port do I need to open from the firewall on the FL server network?

Depending on the configuration of config_fed_server.json which controls which port the gRPC is deployed to, the FL server network needs to open that port for outside clients to reach the FL server. For example, with the following configurations, the AWS instance needs to open 8002 port for TCP traffic to reach the AWS instance:

"service": {
    "target": "ec2-3-81-120-62.compute-1.amazonaws.com:8002",
    "options": [
        ["grpc.max_send_message_length",    1000000000],
        ["grpc.max_receive_message_length", 1000000000]
    ]
}

17. Do federated learning clients need to open any ports for the FL server to reach the FL client?

No, federated learning training does not require for FL clients to open their network for inbound traffic. The server never sends uninvited requests to clients but only responds to client requests.

18. How do I deploy a federated learning server in the cloud for training between different institutes?

../../_images/federated_learning_cloud_server.png

19. What if the federated learning server is behind a load balancer?

Currently, federated learning does not support load balancing between multiple FL servers. We can work on this feature enhancement later.

20. How does the federated learning server decide when to stop FL?

The FL server always runs from the “start_round” to “num_rounds”. The FL server will stop the training when the current round meets “num_rounds”.

21. How does a client decide to quit federated learning training?

The FL client always asks the server for the current round of training. If the server is not ready, the FL client will wait. The client will only stop if the server becomes unreachable. The FL client can also be killed with Ctrl-C.

22. How can we test the network and the SSL certification without setting up Clara?

We will work on a testing tool to test the federated learning server and client SSL connection next.