Multi-Node Training with FTMS
Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.
Verify that your cluster has multiple GPU enabled nodes available for training by running this command:
kubectl get nodes -o wide
The command lists the nodes in your cluster. If it does not list multiple nodes, contact your cluster administrator to get more nodes added to your cluster.
To run a multi-node training job through FTMS, modify these fields in the training job specification:
{
"train": {
"num_gpus": 8, // Number of GPUs per node
"num_nodes": 2 // Number of nodes to use for training
}
}
If these fields are not specified, FTMS uses the default values of one GPU per node and one node.
Note
The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster.
The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.