Managing Resources

One of the common questions from DGX customers is how can they effectively share the DGX system between users without any inadvertent problems or data exchange. The generic phrase for this is resource management, the tools are called resource managers. They can also be called schedulers or job schedulers. These terms are oftentimes used interchangeably.

You can view everything on the DGX system as a resource. This includes memory, CPUs, GPUs, and even storage. Users submit a request to the resource manager with their requirements and the resource manager assigns the resources to the user if they are available and not being used. Otherwise, the resource manager puts the request in a queue to wait for the resources to become available. When the resources are available, the resource manager assigns the resources to the user request.

Resource management so that users can effectively share a centralized resource (in this case, the DGX appliance) has been around a long time. There are many open-source solutions, mostly from the HPC world, such as PBS Pro, Torque, SLURM, Openlava, SGE, HTCondor, and Mesos. There are also commercial resource management tools such as UGE and IBM Spectrum LSF.

If you haven’t used job scheduling before, you should perform some simple experiments first to understand how it works. For example, take a single server and install the resource manager. Then try running some simple jobs using the cores on the server.

The following subsections discusses how you might install and use a job scheduler on a DGX system using some example resource management solutions. The process generally applies to any job scheduler that works with GPUs.

Example: SLURM

Attention:DGX systems do not come prepackaged or pre-installed with job schedulers, although NPN (NVIDIA Partner Network) partners may package specific job schedulers. NVIDIA Support may request that you disable or remove the resource manager for debugging purposes. They may also ask for a factory image to be installed. Without these changes, NVIDIA Support will not be able to continue with debugging process.

As an example, let's say SLURM is installed and configured on a DGX-2, DGX-1, or DGX Station. The first step is to plan how you want to use the DGX system. The first, and by far the easiest configuration, is to assume that a user gets exclusive access to the entire node. In the case the user gets the entire DGX system, i.e. access to all GPUs and CPU cores is given. No other users can use the resources while the first user is using them.

The second way, is to make the GPUs a consumable resource. The user will then ask for the number of GPUs they need ranging from 1 to 8 for the DGX-1 and 1 to 16 for the DGX-2.

There are two public git repositories containing information on SLURM and GPUs, that can help you get started with scheduling jobs.
Note: You may have to configure SLURM to match your specifications.

At a high level, there are two basic options for configuring SLURM with GPUs and DGX systems. The first is to use what is called exclusive mode access and the second allows each GPU to be scheduled independently of the others.

Simple GPU Scheduling With Exclusive Node Access

If you're not interested in allowing simultaneous multiple jobs per compute node, you many not necessarily need to make SLURM aware of the GPUs in the system, and the configuration can be greatly simplified.

One way of scheduling GPUs without making use of GRES (Generic REsource Scheduling) is to create partitions or queues for logical groups of GPUs. For example, grouping nodes with P100 GPUs into a P100 partition would result in something like the following:
$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
p100     up   infinite         4/9/3/16  node[212-213,215-218,220-229]
The corresponding partition configuration via the SLURM configuration file, slurm.conf, would be something like the following:
NodeName=node[212-213,215-218,220-229]
PartitionName=p100 Default=NO DefaultTime=01:00:00 State=UP Nodes=node[212-213,215-218,220-229]

If a user requests a node from the p100 partition, then they would have access to all of the resources in that node, and other users would not. This is what is called exclusive access.

This approach can be advantageous if you are concerned that sharing resources might result in performance issues on the node or if you are concerned about overloading the node resources. For example, in the case of a DGX-1, if you think multiple users might overwhelm the 8TB NFS read cache, then you might want to consider using exclusive mode. Of if you are concerned that the users may use all of the physical memory causing page swapping with a corresponding reduction in performance, then exclusive mode might be useful.

Scheduling Resources At The Per GPU Level

A second option for using SLURM, is to treat the GPUs like a consumable resource and allow users to request them in integer units (i.e. 1, 2, 3, etc.). SLURM can be made aware of GPUs as a consumable resource to allow jobs to request any number of GPUs. This feature requires job accounting to be enabled first; for more info, see Accounting and Resource Limits. A very quick overview is below.

The SLURM configuration file, slurm.conf, needs parameters set to enable cgroups for resource management and GPU resource scheduling. An example is the following:
# General
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

# Scheduling
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

# Logging and Accounting
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres                # show detailed information in Slurm logs about GPU binding and affinity
JobAcctGatherType=jobacct_gather/cgroup
The partition information in slurm.conf defines the available GPUs for each resource. Here is an example:
# Partitions
GresTypes=gpu
NodeName=slurm-node-0[0-1] Gres=gpu:2 CPUs=10 Sockets=1 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=30000 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=48:00:00 DefaultTime=04:00:00 MaxNodes=2 State=UP DefMemPerCPU=3000
The way that resource management is enforced is through cgroups. The cgroups configuration require a separate configuration file, cgroup.conf, such as the following:
CgroupAutomount=yes 
CgroupReleaseAgentDir="/etc/slurm/cgroup" 

ConstrainCores=yes 
ConstrainDevices=yes
ConstrainRAMSpace=yes
#TaskAffinity=yes
To schedule GPU resources requires a configuration file to define the available GPUs and their CPU affinity. An example configuration file, gres.conf, is below:
Name=gpu File=/dev/nvidia0 CPUs=0-4
Name=gpu File=/dev/nvidia1 CPUs=5-9
To run a job utilizing GPU resources requires using the --gres flag with the srun command. For example, to run a job requiring a single GPU the following srun command can be used.
$ srun --gres=gpu:1 nvidia-smi

You also may want to restrict memory usage on shared nodes so that a user doesn’t cause swapping with other user or system processes. A convenient way to do this is with memory cgroups.

Using memory cgroups can be used to restrict jobs to allocated memory resources requires setting kernel parameters. On Ubuntu systems this is configurable via the file /etc/default/grub.
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"