Cluster Management
Cluster management tools go beyond resource managers and job schedulers, managing the state of each node in an entire cluster. They typically include mechanisms to provision the nodes in the cluster (install the operating system image, firmware, and drivers), deploy a job scheduler, monitor and manage hardware, configure user access, and make modifications to the software stack.
Provisioning and cluster management of DGX Systems may be bootstrapped with DeepOps. DeepOps is open source and highly modular. It has defaults which can be configured to meet organizational needs and incorporates best practices for deploying GPU-accelerated Kubernetes and Slurm.
Head Node
A head node is a very useful server within a cluster. Typically, it runs the cluster management software, the resource manager, and any monitoring tools that are used. For smaller clusters, it is also used as a login node for users to create and submit jobs.
For clusters of any size that include the DGX-2, DGX-1, or even a group of DGX Stations, a head node can be very helpful. It allows the DGX systems to focus solely on computing rather than any interactive logins or post-processing that users may be doing. As the number of nodes in a cluster increases, it is recommended to use a head node.
It is recommended to size the head node for things such as:
Interactive user logins
Resource management (running a job scheduler)
Graphical pre-processing and post-processing
Consider a GPU in the head node for visualization
Cluster monitoring
Cluster management
Since the head node becomes an important part of the operation of the cluster, consider using RAID-1 for the OS drive in the head node as well as redundant power supplies. This can help improve the uptime of the head node.
For smaller clusters, you can also use the head node as an NFS server by adding storage and more memory to the head node and NFS export the storage to the cluster clients. For larger clusters, it is recommended to have dedicated storage, either NFS or a parallel file system.
For InfiniBand networks, the head node can also be used for running the software SM. If you want some HA for the SM, run the primary SM on the head node and use an SM on the IB switch as a secondary SM (hardware SM).
As the cluster grows, it is recommended to consider splitting the login and data processing functions from the head node to one or more dedicated login nodes. This is also true as the number of users grows. You can run the primary SM on the head node and other SM’s on the login nodes. You could even use the hardware SM’s on the switches as backups.
Bright Computing Cluster Manager
Alternatively, Bright Cluster Manager deploys complete DGX PODs over bare metal and manages them effectively. It provides management for the entire DGX POD, including the hardware, operating system, and users. It even manages the Data Analytics software, NGC, Bright Data Science, Kubernetes, Docker and Singularity Containers. With Bright Cluster Manager, a system administrator can quickly stand up DGX PODs and keep them running reliably throughout their life cycle—all with the ease and elegance of a fully-featured, enterprise-grade cluster manager.
Example: Bright Computing Cluster See the knowledge base article How do I add NVIDIA DGX nodes to a Bright cluster using the official Ubuntu DGX software stack?