Managing Resources

One of the common questions from DGX customers is how can they effectively share the DGX system between users without any inadvertent problems or data exchange. The generic phrase for this is resource management; the tools are called resource managers. They can also be called schedulers or job schedulers. These terms are often used interchangeably.

Everything on the DGX system can be viewed as a resource. This includes memory, CPUs, GPUs, and even storage. Users submit a request to the resource manager with their requirements and the resource manager assigns the resources to the user if they are available and not being used. Otherwise, the resource manager puts the request in a queue to wait for the resources to become available. When the resources are available, the resource manager assigns the resources to the user request. This request is known as a "job".

A resource manager provides functionality to act on jobs such as starting, canceling, or monitoring them . It manages a queue of jobs for a single cluster of resources, each job using a subset of computing resources. It also monitors resource configuration and health, launching jobs to a single FIFO queue.

A job scheduler ties together multiple resource managers into one integrated domain, managing jobs across all machines in the domain. It implements policy mechanisms to achieve efficient utilization of resources,manages software licenses, and collects and reports resource usage statistics.

Some tools that started as resource managers have graduated to include job scheduler features, making the terms largely synonymous.

Resource managers and job schedulers have been around for a long time and are extensively used in the HPC world. The end of this section will include examples of how to run solutions such as SLURM, Univa Grid Engine, IBM Spectrum LSF, and Altair PBS Pro. If you haven’t used these tools before, you should perform some simple experiments first to understand how they work. For example, take a single server and install the software, then try running some simple jobs using the cores on the server. Expand as desired and add more nodes to the cluster.

The following subsections discuss how to install and use a job scheduler on a DGX system. For DGX systems, NVIDIA supports deploying the Slurm or Kubernetes resource managers through the use of DeepOps. DeepOps is a modular collection of ansible scripts which automate the deployment of Kubernetes, Slurm, or a hybrid combination of the two, along with monitoring services and other ancillary functionality to help manage systems.

Attention: DGX systems do not come pre-installed with job schedulers, although NPN (NVIDIA Partner Network) partners may elect to install a job scheduler as part of the larger deployment service. NVIDIA Support may request disabling or removing the job scheduler software for debugging purposes. They may also ask for a factory image to be installed. Without these changes, NVIDIA Support will not be able to continue with debugging process.

Example: SLURM

Slurm is a batch scheduler often used in HPC environments, but is simple to install and flexible in configuration, so has seen wide adoption in a variety of areas. The following are suggested methods for installing SLURM.

After Slurm is installed and configured on a DGX-2, DGX-1, or DGX Station, the next step is to plan how to use the DGX system. The first, and by far the easiest, configuration is to assume that a user gets exclusive access to the entire node. In this case the user gets the entire DGX system, i.e. access to all GPUs and CPU cores. No other users can use the resources while the first user is using them.

The second way, is to make the GPUs a consumable resource. The user will then ask for the number of GPUs they need ranging from 1 to 8 for the DGX-1 and 1 to 16 for the DGX-2.

At a high level, there are two basic options for configuring SLURM with GPUs and DGX systems. The first is to use what is called exclusive mode access and the second allows each GPU to be scheduled independently of the others.

Simple GPU Scheduling With Exclusive Node Access

If there is no interest in allowing simultaneous multiple jobs per compute node, then Slurm might not need to be aware of the GPUs in the system and the configuration can be greatly simplified.

One way of scheduling GPUs without making use of GRES (Generic REsource Scheduling) is to create partitions or queues for logical groups of GPUs. For example, grouping nodes with V100 GPUs into a V100 partition would result in something like the following:
$ sinfo -s
v100     up   infinite         4/9/3/16  node[212-213,215-218,220-229]
The corresponding partition configuration via the SLURM configuration file, slurm.conf, would be something like the following:
PartitionName=v100 Default=NO DefaultTime=01:00:00 State=UP Nodes=node[212-213,215-218,220-229]

If a user requests a node from the v100 partition, then they would have access to all of the resources in that node, and other users would not. This is what is called exclusive access.

This approach can be advantageous if there is concern that sharing resources might result in performance issues on the node or if there are concerns about overloading the node resources. For example, in the case of a DGX-1, if multiple users might overwhelm the 8TB NFS read cache, exclusive mode shouldbe considered. Or if the concern is that users may use all of the physical memory causing page swapping with a corresponding reduction in performance, then exclusive mode might be useful.

Scheduling Resources At the Per-GPU Level

A second option for using SLURM, is to treat the GPUs like a consumable resource and allow users to request them in integer units (i.e. 1, 2, 3, etc.). SLURM can be made aware of GPUs as a consumable resource to allow jobs to request any number of GPUs. This feature requires job accounting to be enabled first; for more info, see Accounting and Resource Limits. A very quick overview is below.

The SLURM configuration file, slurm.conf, needs parameters set to enable cgroups for resource management and GPU resource scheduling. An example is the following:
# General

# Scheduling

# Logging and Accounting
DebugFlags=CPU_Bind,gres                # show detailed information in Slurm logs about GPU binding and affinity
The partition information in slurm.conf defines the available GPUs for each resource. Here is an example:
# Partitions
NodeName=slurm-node-0[0-1] Gres=gpu:2 CPUs=10 Sockets=1 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=30000 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=48:00:00 DefaultTime=04:00:00 MaxNodes=2 State=UP DefMemPerCPU=3000
The way that resource management is enforced is through cgroups. The cgroups configuration require a separate configuration file, cgroup.conf, such as the following:

To schedule GPU resources requires a configuration file to define the available GPUs and their CPU affinity. An example configuration file, gres.conf, is below:
Name=gpu File=/dev/nvidia0 CPUs=0-4
Name=gpu File=/dev/nvidia1 CPUs=5-9
To run a job utilizing GPU resources requires using the --gres flag with the srun command. For example, to run a job requiring a single GPU the following srun command can be used.
$ srun --gres=gpu:1 nvidia-smi

You also may want to restrict memory usage on shared nodes so that a user doesn’t cause swapping with other user or system processes. A convenient way to do this is with memory cgroups.

Using memory cgroups can be used to restrict jobs to allocated memory resources requires setting kernel parameters. On Ubuntu systems this is configurable via the file /etc/default/grub.
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"

Example: Altair PBS Pro

See the following site for a link to the technical whitepaper: Altair PBS Professional Support on NVIDIA DGX Systems