Abstract

This DGX Best Practices Guide provides recommendations to help administrators and users administer and manage DGX products, such as DGX-2, DGX-1, and DGX Station.

1. Overview

NVIDIA has created the DGX family as appliances to make administration and operation as simple as possible. However, like any computational resource it still requires administration. This section discusses some of the best practices around configuring and administering a single DGX or several DGX appliances.

There is also some discussion about how to plan for external storage, networking, and other configuration aspects for the DGX, focusing on DGX-2, DGX-1, and DGX Station (the DGX "family").

For detailed information about implementation, see:

2. Storage

For deep learning to be effective and to take full advantage of the DGX family, the various aspects of the system have to be balanced. This includes storage and I/O. This is particularly important for feeding data to the GPUs to keep them busy and dramatically reduce run times for models. This section presents some best practices for storage within and outside of the DGX-2, DGX-1, or DGX Station. It also talks about storage considerations as the number of DGX units are scaled out, primarily the DGX-1 and DGX-2.

2.1. Internal Storage (NFS Cache)

The first storage consideration is storage within the DGX itself. The focus of the internal storage, outside of the OS drive, is performance. The DGX appliances have some SSDs that are in a RAID-0 group for the best possible performance. The RAID-0 group is used as an NFS read cache using the Linux cacheFS capability.

For the DGX Station, there are three 1.92 TB SSDs in a RAID-0 group. The Linux software RAID tool, mdadm, is used to manage and monitor the RAID-0 group. For the DGX-1, there are four 1.92TB SSD drives but they are connected to an LSI RAID card that creates and manages a RAID-0 group. The Storage Command Line Tool (StorCLI) is used by the LSI card. As with all DGX appliances, the RAID-0 group is used as a read cache for NFS mounts.

The DGX-2 is very similar to the DGX-1, except it has 8 or 16 3.84 TB NVMe drives that are managed by the OS using mdadm (software RAID). For more information, see DGX-2 Service Manual.

Deep Learning I/O patterns typically consist of multiple iterations of reading the training data. The first epoch of training reads the data that is used to start training the model. Subsequent passes through the data can avoid rereading the data from NFS if adequate local caching is provided on the node. If you can estimate the maximum size of your data, you can architect your system to provide enough cache so that the data only needs to be read once during any training job. A set of very fast SSD disks can provide an inexpensive and scalable way of providing adequate caching for your applications. The DGX family NFS read cache was created for precisely this purpose.

For training the best possible model, the input data is randomized. This adds some additional statistical noise to the training and also keeps the model from being “overfit” on the training data (in other words, trained very well on the training data but doesn’t do well on the validation data). Randomizing the order of the data for training puts pressure on the data access. The I/O pattern becomes random oriented rather than streaming oriented. The DGX family NFS cache is SSD based with a very high level of random IOPs performance.

The benefit of adequate caching is that your external filesystem does not have to provide maximum performance during a cold start (the first epoch), since this first pass through the data is only a small part of the overall training time. For example, typical training sessions can iterate over the data 100 times. If we assume a 5x slower read access time during the first cold start iteration vs the remaining iterations with cached access, then the total run time of training increases by the following amount.
  • 5x slower shared storage 1st iteration + 99 local cached storage iterations
    • > 4% increase in runtime over 100 iterations
Even if your external file system cannot sustain peak training IO performance, it has only a small impact on overall training time. This should be considered when creating your storage system to allow you to develop the most cost-effective storage systems for your workloads.

For either the DGX Station or the DGX-1 you cannot put additional drives into the system without voiding your warranty. For the DGX-2, you can add additional 8 U.2 NVMe drives to the already in the system.

2.2. RAID-0

The internal SSD drives are put into a RAID-0 group, formatted with ext4, and mounted as a file system. This is then used as an NFS read cache to cache data reads. Recall that it’s number one focus is performance.

RAID-0 refers to the RAID “level” of the drives that are grouped together. It stripes (splits the data evenly) across the collective disk space but doesn’t do any mirroring or parity checks. This reduces the availability of the RAID group but it also improves its performance and capacity.

It stripes the contents of each file across all disks in the RAID group. The capacity of a RAID-0 group is the sum of the capacities of the drives in the set. The largest amount of capacity used in each drive is the size of the smallest drive. For example, If you have two 2TB drives and one 3TB drives, then the RAID-0 set will use 6TB total (3 x 2TB).

The performance of a RAID-0 group, which results in improved throughput of read and write operations to any file, is the number of drives multiplied by their performance. As an example, if the drives are capable of a sequential read throughput of 550 MB/s and you have three drives in the RAID group, then the theoretical sequential throughput is 3 x 550MB/s = 1650 MB/s.

Remember, you can create a RAID-0 group from various types of drives, but the capacity contribution of each is limited by the size of the smallest drive. The performance of the RAID group is the sum of the performance of each drive.

2.3. DGX-1 Internal Storage

In the DGX-1, there are a total five 1.92TB SSDs. These are plugged into the LSI controller. There are two arrays: a 1-drive RAID-0 for the OS, and a 4-drive RAID-0 for /raid. The 4-drive RAID-0 group is used as the read cache for NFS file systems.

2.4. DGX Station Internal Storage

The RAID-0 group is created and managed by Linux software, mdadm. Mdadm is also referred to as “software RAID” because all of the common RAID functions are carried out by the host CPUs and the host OS instead of a dedicated RAID controller processor. Linux software RAID configurations can include anything presented to the Linux kernel as a block device. Examples include whole hard drives (for example, /dev/sda), and their partitions (for example, /dev/sda1).

Of particular importance is that since version 3.7 of the Linux kernel mainline, mdadm supports TRIM operations for the underlying solid-state drives (SSDs), for linear, RAID 0, RAID 1, RAID 5 and RAID 10 layouts. TRIM is very important because it helps with garbage collection on SSDs. This reduces write amplification and reduces the wear on the drive.

There are some very simple commands using mdadm that you can use for monitoring the status of the RAID array. The first thing you should do is find the mount point for the RAID group. You can do this by simply running the command mount -a. Look through the output for mdadm in this output.
# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
…
/dev/md0 on /raid type ext4 (rw,relatime,discard,stripe=384,data=ordered)
/dev/sdb1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
...

This is RAID-0 group that acts as a read cache for NFS file systems.

One of the first commands that can be run is to check the status of the RAID group. The command is simple, cat /proc/mdstat.
# cat /proc/mdstat
Personalities : [raid0] [linear] [multipath] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid0 sde[2] sdd[1] sdc[0]
  	5625729024 blocks super 1.2 512k chunks

unused devices: <none>

This is a command to read the mdstat file in the /proc filesystem. The output looks compact but there is a great deal of information in that output. The first line of output is a list of the possible ways to use mdadm in this version of Linux.

The next lines of output will present some details on each md device. In this case, the DGX Station only has one RAID group, /dev/md0. The output for /dev/md0 means that it is an active RAID-0 group. IT has three devices:
  • sdc
  • sdd
  • sde

It also lists the number of blocks in the device, the version of the super block (1.2), and the chunk size (512k). This is the size that is written to each device when mdadm breaks up a file. This information would be repeated for each md device (if there are more than one).

Another option you can use with mdadm is to examine/query the individual block devices in the RAID group and examine/query the RAID groups themselves. A simple example from a DGX Station is below. The command queries the RAID group.
# mdadm --query /dev/md0
/dev/md0: 5365.11GiB raid0 3 devices, 0 spares. Use mdadm --detail for more detail.

Notice that there are 3 devices with a total capacity of 5,365.11 GiB (this is different than GB). It also has no spares (you can use spare drives in RAID groups with mdadm).

Next is an example of querying a block device that is part of a RAID group.
# mdadm --query /dev/sdc
/dev/sdc: is not an md array
/dev/sdc: device 0 in 3 device active raid0 /dev/md0.  Use mdadm --examine for more detail.

The query informs you that the drive is not a RAID group but is part of a RAID group (/dev/md0). It also advises to examine the RAID group using the “examine” (-E) option.

Querying the block devices and the RAID group itself, you can put together how the block devices are part of the RAID group. Also notice that the commands are run by the root user (or something with root privileges).

To get even more detail about the md RAID group, you can use the -E option. It prints the md superblock (if present) from a block device that could be an group component.
# mdadm -E /dev/sdc
/dev/sdc:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 1feabd66:ec5037af:9a40a569:d7023bc5
            Name : demouser-DGX-Station:0  (local to host demouser-DGX-Station)
   Creation Time : Wed Mar 14 16:01:24 2018
      Raid Level : raid0
    Raid Devices : 3

  Avail Dev Size : 3750486704 (1788.37 GiB 1920.25 GB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262056 sectors, after=0 sectors
           State : clean
     Device UUID : 482e0074:35289a95:7d15e226:fe5cbf30

     Update Time : Wed Mar 14 16:01:24 2018
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : ee25db67 - correct
          Events : 0

      Chunk Size : 512K

     Device Role : Active device 0
     Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
It provides information about the RAID array (group) including things such as:
  • Creation time
  • UUID of the array (RAID group)
  • RAID level (this is RAID-0)
  • Number of RAID devices
  • Size of the device both in Gib and GB (they are different)
  • The state of the device (clean)
  • Number of active devices in RAID array (3)
  • The role of the device (if is device 0 in the raid array)
  • The checksum and if it is correct
  • Lists the number of events on the array
Another way to get just about the same information but some extra detail, is to use the --detail option with the raid array as below.
# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Mar 14 16:01:24 2018
     Raid Level : raid0
     Array Size : 5625729024 (5365.11 GiB 5760.75 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Wed Mar 14 16:01:24 2018
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 512K

           Name : demouser-DGX-Station:0  (local to host demouser-DGX-Station)
           UUID : 1feabd66:ec5037af:9a40a569:d7023bc5
         Events : 0

	Number   Major   Minor   RaidDevice State
   	0   	8   	32    	0   active sync   /dev/sdc
   	1   	8   	48    	1   active sync   /dev/sdd
   	2   	8   	64    	2   active sync   /dev/sde

2.5. External Storage

As an organization scales out their GPU enabled data center, there are many shared storage technologies which pair well with GPU applications. Since the performance of a GPU enabled server is so much greater than a traditional CPU server, special care needs to be taken to ensure the performance of your storage system is not a bottleneck to your workflow.

Different data types require different considerations for efficient access from filesystems. For example:
  • Running parallel HPC applications may require the storage technology to support multiple processes accessing the same files simultaneously.
  • To support accelerated analytics, storage technologies often need to support many threads with quick access to small pieces of data.
  • For vision based deep learning, accessing images or video used in classification, object detection or segmentation may require high streaming bandwidth, fast random access, or fast memory mapped (mmap()) performance.
  • For other deep learning techniques, such as recurrent networks, working with text or speech can require any combination of fast bandwidth with random and small files.

HPC workloads typically drive high simultaneous multi-system write performance and benefit greatly from traditional scalable parallel file system solutions. You can size HPC storage and network performance to meet the increased dense compute needs of GPU servers. It is not uncommon to see per-node performance increases from between 10-40x for a 4 GPU system vs a CPU system for many HPC applications.

Data Analytics workloads, similar to HPC, drive high simultaneous access, but are more read focused than HPC. Again, it is important to size Data Analytics storage to match the dense compute performance of GPU servers. As you adopt accelerated analytics technologies such as GPU-enabled in-memory databases, make sure that you can populate the database from your data warehousing solution quickly to minimize startup time when you change database schemas. This may require a network with 10 Gbe for greater performance. To support clients at this rate, you may have to revisit your data warehouse architecture to identify and eliminate bottlenecks.

Deep learning is a fast evolving computational paradigm and it is important to know what your requirements are in the near and long term to properly architect a storage system. The ImageNet database is often used as a reference when benchmarking deep learning frameworks and networks. The resolution of the images in ImageNet are 256x256. However, it is more common to find images at 1080p or 4k. Images in 1080p resolution are 30 times larger than those in ImageNet. Images in 4k resolution are 4 times larger than that (120X the size of ImageNet images). Uncompressed images are 5-10 times larger than compressed images. If your data cannot be compressed for some reason, for example if you are using a custom image formats, the bandwidth requirements increase dramatically.

For AI-Driven Storage, it is suggested that you make use of deep learning framework features that build databases and archives versus accessing small files directly; reading and writing many small files will reduce performance on the network and local file systems. Storing files in formats such as HDF5, LMDB or LevelDB can reduce metadata access to the filesystem helping performance. However, these formats can lead to their own challenges with additional memory overhead or requiring support for fast mmap() performance. All this means that you should plan to be able to read data at 150-200 MB/s per GPU for files at 1080p resolution. Consider more if you are working with 4k or uncompressed files.

2.5.1. NFS Storage

NFS can provide a good starting point for AI workloads on small GPU server configurations with properly sized storage and network bandwidth. NFS based solutions can scale well for larger deployments, but be aware of possible single node and aggregate bandwidth requirements and make sure that matches your vendor of choice. As you scale your data center to need more than 10 GB/s or your data center grows to hundreds or thousands of nodes, other technologies may be more efficient and scale better.

Generally, it is a good idea to start with NFS using one or more of the 10 Gb/s Ethernet connections on the DGX family. After this is configured, it is recommended that you run your applications and check if IO performance is a bottleneck. Typically, NFS over 10Gb/s Ethernet provides up to 1.25 GB/s of IO throughput for large block sizes. If, in your testing, you see NFS performance that is significantly lower than this, check the network between the NFS server and a DGX server to make sure there are no bottlenecks (for example, a 1 GigE network connection somewhere, a misconfigured NFS server, or a smaller MTU somewhere in the network).

There are a number of online articles, such as this one, that list some suggestions for tuning NFS performance on both the client and the server. For example:
  • Increasing Read, Write buffer sizes
  • TCP optimizations including larger buffer sizes
  • Increasing the MTU size to 9000
  • Sync vs. Async
  • NFS Server options
  • Increasing the number of NFS server daemons
  • Increasing the amount of NFS server memory
Linux is very flexible and by default most distributions are conservative about their choice of IO buffer sizes since the amount of memory on the client system is unknown. A quick example is increasing the size of the read buffers on the DGX (the NFS client). This can be achieved with the following system parameters:
  • net.core.rmem_max=67108864
  • net.core.rmem_default=67108864
  • net.core.optmem_max=67108864

The values after the variable are example values (they are in bytes). You can change these values on the NFS client and the NFS server, and then run experiments to determine if the IO performance improves.

The previous examples are for the kernel read buffer values. You can also do the same thing for the write buffers where you use wmem instead rmem.

You can also tune the TCP parameters in the NFS client to make them larger. For example, you could change the net.ipv4.tcp_rmem=”4096 87380 33554432” system parameter.

This changes the TCP buffer size, for iPv4, to 4,096 bytes as a minimum, 87,380 bytes as the default, and 33,554,432 bytes as the maximum.

If you can control the NFS server, one suggestion is to increases the number of NFS daemons on the server. By default, NFS only starts with eight nfsd processes (eight threads), which, given that CPUs today have very large core counts, is not really enough.

You can find the number of NFS daemons in two ways. The first is to look at the process table and count the number of NFS processes via the $ ps -aux | grep nfs command.

The second way is to look at the NFS config file (for example, /etc/sysconfig/nfs) for an entry that says RPCNFSDCOUNT. This tells you the number of NFS daemons for the server.

If the NFS server has a large number of cores and a fair amount of memory, you can increase RPCNFSDCOUNT. There are cases where good performance has been achieved using 256 on an NFS server with 16 cores and 128GB of memory.

You should also increase RPCNFSDCOUNT when you have a large number of NFS clients performing I/O at the same time. For this situation, it is recommended that you should also increase the amount of memory on the NFS server to a larger number, such as 128 or 256GB. Don't forget that if you change the value of RPCNFSDCOUNT, you will have to restart NFS for the change to take effect.

One way to determine whether more NFS threads helps performance is to check the data in /proc/net/rpc/nfs entry for the load on the NFS daemons. The output line that starts with th lists the number of threads, and the last 10 numbers are a histogram of the number of seconds the first 10% of threads were busy, the second 10%, and so on.

Ideally, you want the last two numbers to be zero or close to zero, indicating that the threads are busy and you are not "wasting" any threads. If the last two numbers are fairly high, you should add NFS daemons, because the NFS server has become the bottleneck. If the last two, three, or four numbers are zero, then some threads are probably not being used.

One other option, while a little more complex, can prove to be useful if the IO pattern becomes more write intensive. If you are not getting the IO performance you need, change the mount behavior on the NFS clients from “sync” to “async”.
CAUTION:
By default, NFS file systems are mounted as “sync” which means the NFS client is told the data is on the NFS server after it has actually been written to the storage indicating the data is safe. Some systems will respond that the data is safe if it has made it to the write buffer on the NFS server and not the actual storage.

Switching from “sync” to “async” means that the NFS server responds to the NFS client that the data has been received when the data is in the NFS buffers on the server (in other words, in memory). The data hasn’t actually been written to the storage yet, it’s still in memory. Typically, writing to the storage is much slower than writing to memory, so write performance with “async” is much faster than with “sync”. However, if, for some reason, the NFS server goes down before the data in memory is written to the storage, then the data is lost.

If you try using “async” on the NFS client (in other words, the DGX system), ensure that the data on the NFS server is replicated somewhere else so that if the server goes down, there is always a copy of the original data. The reason is if the NFS clients are using “async” and the NFS server goes down, data that is in memory on the NFS server will be lost and cannot be recovered.

NFS “async” mode is very useful for write IO, both streaming (sequential) and random IO. It is also very useful for “scratch” file systems where data is stored temporarily (in other words, not permanent storage or storage that is not replicated or backed up).

If you find that the IO performance is not what you expected and your applications are spending a great deal of time waiting for data, then you can also connect NFS to the DGX system over InfiniBand using IPoIB (IP over IB). This is part of the DGX family software stack and can be easily configured. The main point is that the NFS server should be InfiniBand attached as well as the NFS clients. This can greatly improve IO performance.

Distributed Filesystem

Distributed filesystems such asEXAScaler,GRIDScaler,Ceph, Lustre,MapR-FS,General Parallel File System,Weka.io, andGluster can provide features like improved aggregate IO performance, scalability, and/or reliability (fault tolerance). These filesystems are supported by their respective providers unless otherwise noted.

2.5.3. Scaling Out Recommendations

Based on the general IO patterns of deep learning frameworks (see External Storage), below are suggestions for storage needs based on the use case. These are suggestions only and are to be viewed as general guidelines.

Table 1. Scaling out suggestions and guidelines
Use Case Adequate Read Cache? Network Type Recommended Network File System Options
Data Analytics NA 10 Gbe Object-Storage, NFS, or other system with good multithreaded read and small file performance
HPC NA 10/40/100 GBe, InfiniBand NFS or HPC targeted filesystem with support for large numbers of clients and fast single-node performance
DL, 256x256 images yes 10 Gbe NFS or storage with good small file support
DL, 1080p images yes 10/40 Gbe, InfiniBand High-end NFS, HPC filesystem or storage with fast streaming performance
DL, 4k images yes 40 Gbe, InfiniBand HPC file system, high-end NFS or storage with fast streaming performance capable of 3+ GB/s per node
DL, uncompressed Images yes InfiniBand, 40/100 Gbe HPC filesystem, high-end NFS or storage with fast streaming performance capable of 3+ GB/s per node
DL, Datasets that are not cached no InfiniBand, 10/40/100 Gbe Same as above, aggregate storage performance must scale to meet the all applications simultaneously

As always, it is best to understand your own applications’ requirements to architect the optimal storage system.

Lastly, this discussion has focused only on performance needs. Reliability, resiliency and manageability are as important as the performance characteristics. When choosing between different solutions that meet your performance needs, make sure that you have considered all aspects of running a storage system and the needs of your organization to select the solution that will provide the maximum overall value.

3. Authenticating Users

To make the DGX useful, users need to be added to the system in some fashion so they can be authenticated to use the system. Generally, this is referred to as user authentication. There are several different ways this can be accomplished, however, each method has its own pros and cons.

3.1. Local

The first way is to create users directly on the DGX system using the useradd command. Let’s assume you want to add a user dgxuser. You would first add the user via the following command.
$ useradd -m -s /bin/bash dgxuser
Where -s refers to the default shell for the user and -m creates the user’s home directory. After creating the user you need to add them to the docker group on the DGX.
$ sudo usermod -aG docker dgxuser

This adds the user dgxuser to the group docker. Any user that runs Docker containers has to be a member of this group.

Using authentication on the DGX is simple but not without its issues. First, there have been occasions when an OS upgrade on the DGX requires the reformatting of all the drives in the appliance. If this happens, you first must make sure all user data is copied somewhere off the DGX-1 before the upgrade. Second, you will have to recreate the users and add them to the docker group and copy their home data back to the DGX. This adds work and time to upgrading the system.
Important: While the 2x 960GB NVME SSDs on the DGX-2, meant for the OS partition, are in RAID-1 configuration, there is no RAID-1 on the OS drive for the DGX-1 and DGX Station. Hence, if the OS drive fails on the DGX-1 or the DGX Station, you will lose all the users and everything in the /home directories. Therefore, it is highly recommended that you backup the pertinent files on the DGX system as well as /home for the users.

3.2. NIS Vs NIS+

Another authentication option is to use NIS or NIS+. In this case, the DGX would be a client in the NIS/NIS+ configuration. As with using local authentication as previously discussed, there is the possibility that the OS drive in the DGX could be overwritten during an upgrade (not all upgrades reformat the drives, but it’s possible). This means that the administrator may have to reinstall the NIS configuration on the DGX.

Also, remember that the DGX-1 and DGX Station have a single OS drive. If this drive fails, the administrator will have to re-configure the NIS/NIS+ configuration, therefore, backups are encouraged; even for DGX-2 systems, which do have 2x OS drives in a RAID-1 configuration.
Note: It is possible that if, in the unlikely event that technical support for the DGX is needed, the NVIDIA engineers may require the administrator to disconnect from the NIS/NIS+ server.

3.3. LDAP

A third option for authentication is LDAP (Lightweight Directory Access Protocol). It has become very popular in the clustering world, particularly for Linux. You can configure LDAP on the DGX for user information and authentication from an LDAP server. However, as with NIS, there are possible repercussions.
CAUTION:
  • The first is that the OS drive is a single drive on the DGX-1 and DGX Station. If the drive fails, you will have to rebuild the LDAP configuration (backups are highly recommended).
  • The second is that, as previously mentioned, if, in the unlikely event of needing tech support, you may be asked to disconnect the DGX system from the LDAP server so that the system can be triaged.

3.4. Active Directory

One other option for user authentication is connecting the DGX system to an Active Directory (AD) server. This may require the system administrator to install some extra tools into the DGX. This means that this approach should also include the two cautions that were repeated before where the single OS drive may be reformatted for an upgrade or that it may fail (again, backups are highly recommended). It also means that in the unlikely case of needing to involve NVIDIA technical support, you may be asked to take the system off the AD network and remove any added software (this is unlikely but possible).

4. Time Synchronization

Time synchronization is very important for clusters of systems including storage. It is especially true for MPI (Message Passing Interface) applications such as those in the HPC world. Without time synchronization, you can get wrong answers or your application can fail. Therefore, it is a good idea to sync the DGX-2, DGX-1, or DGX Station time.

4.1. Ubuntu 16.04

If you are using Ubuntu 16.04 as the base for your DGX OS image, realize that it uses systemd instead of init, so the process of configuring NTP (network time protocol) is a little different than using Ubuntu 14.04. If you are unsure on how to accomplish this, below are some basic instructions.
For more information, you can:
  • Run the following commands:
    $ man timedatectl command
    $ man systemd-timesyncd.service
    
  • Read the timesyncd.conf article.

Here is an outline of the steps you should follow:

  1. Edit the /etc/systemd/timesyncd.conf file and set NTP and other options. For more information, see the timesyncd.conf article.
  2. Run as root the following command:
    $ timedatectl set-ntp true
  3. Check that timesyncd is enabled and run the following command:
    $ systemctl status systemd-timesyncd.service
    
  4. Ensure timesyncd is enabled. If timesyncd is not enabled, run the following command.
    $ systemctl enable systemd-timesyncd.service &&  systemctl start systemd-timesyncd.service
    

    You can also check via timedatectl that you configured the correct timezone and other basic options.

5. Monitoring

Being able to monitor your systems is the first step in being able to manage them. NVIDIA provides some very useful command line tools that can be used specifically for monitoring the GPUs.

5.1. 

() simplifies GPU administration in the data center. It improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. It can perform the following tasks with very low overhead on the appliance.
  • Active health monitoring
  • Diagnostics
  • System validation
  • Policies
  • Power and clock management
  • Group configuration and accounting

The Toolkit comes with a User Guide that explains how to use the command-line tool called dcgmi, as well as an API Guide (there is no GUI with ). In addition to the command-line tool, also comes with headers and libraries for writing your own tools in Python or C.

Rather than treat each GPU as a separate resource, DCGM allows you to group them and then apply policies or tuning options to the group. This also includes being able to run diagnostics on the group.

There are several best practices for using with the DGX appliances. The first is that the command line tool can run diagnostics on the GPUs. You could create a simple cron job on the DGX to check the GPUs and store the results either into a simple flat file or into a simple database.

There are three levels of diagnostics that can be run starting with level 1.
  • Level 1 runs in just a few seconds.
  • Level 3 takes about 4 minutes to run. An example of the output from running a level 3 diagnostic is below.
    Figure 1. Levels of diagnostics Levels of diagnostics

It is fairly easy to parse this output looking for Error in the output. You can easily send an email or raise some other alert if an Error is discovered.

A second best practice for utilizing is if you have a resource manager (in other words, a job scheduler) installed. Before the user’s job is run, the resource manager can usually perform what is termed a prologue. That is, any system calls before the user’s job is executed. This is a good place to run a quick diagnostic and also use to start gathering statistics on the job. Below is an example of statistics gathering for a particular job:
Figure 2. Statistics gathering Statistics gathering

When the user’s job is complete, the resource manager can run something called an epilogue. This is a place where the system can run some system calls for doing such things as cleaning up the environment or summarizing the results of the run including the GPU stats as from the above command. Consult the user’s guide to learn more about stats with .

If you create a set of prologue and epilogue scripts that run diagnostics you might want to consider storing the results in a flat file or a simple database. This allows you to keep a history of the diagnostics of the GPUs so you can pinpoint any issues (if there are any).

A third way to effectively use is to combine it with a parallel shell tool such as pdsh. With a parallel shell you can run the same command across all of the nodes in a cluster or a specific subset of nodes. You can use it to run dcgmi to run diagnostics across several DGX appliances or a combination of DGX appliances and non-GPU enabled systems. You can easily capture this output and store it in a flat file or a database. Then you can parse the output and create warnings or emails based on the output.

Having all of this diagnostic output is also an excellent source of information for creating reports regarding topics such as utilization.

For more information about , see NVIDIA Data Center GPU Manager Simplifies Cluster Administration.

5.2. Using ctop For Monitoring

Containers can make monitoring a little more challenging than the classic system monitoring. One of the classic tools used by system administrators is top. By default, top displays the load on the system as well as the ordered list of processes on the system.

There is a top-like tool for Docker containers and runC, named ctop. It lists real-time metrics for more than one container and is easy to install and update the resource usage for the running containers.
Attention: ctop runs on a single DGX-1 system only. Most likely you will have to log into the specific node and run ctop. A best practice is to use tmux and create a pane for ctop for each DGX appliance if the number of systems is fairly small (approximately less than 10).

5.3. Monitoring A Specific DGX Using nvidia-smi

As previously discussed, is a great tool for monitoring GPUs across multiple nodes. Sometimes, a system administrator may want to monitor a specific DGX system in real-time. An easy way to do this is to login into the DGX and run nvidia-smi in conjunction with the watch command.

For example, you could run the command watch -n 1 nvidia-smi that runs the nvidia-smi command every second (-n 1 means to run the command with 1 second intervals). You could also add the -d option to watch so that it highlights changes or differences since the last time it was run. This allows you to easily see what has changed.

Just like ctop, you can use nvidia-smi and watch in a pane in a tmux terminal to keep an eye on a relatively small number of DGX servers.

6. Managing Resources

One of the common questions from DGX customers is how can they effectively share the DGX system between users without any inadvertent problems or data exchange. The generic phrase for this is resource management, the tools are called resource managers. They can also be called schedulers or job schedulers. These terms are oftentimes used interchangeably.

You can view everything on the DGX system as a resource. This includes memory, CPUs, GPUs, and even storage. Users submit a request to the resource manager with their requirements and the resource manager assigns the resources to the user if they are available and not being used. Otherwise, the resource manager puts the request in a queue to wait for the resources to become available. When the resources are available, the resource manager assigns the resources to the user request.

Resource management so that users can effectively share a centralized resource (in this case, the DGX appliance) has been around a long time. There are many open-source solutions, mostly from the HPC world, such as PBS Pro, Torque, SLURM, Openlava, SGE, HTCondor, and Mesos. There are also commercial resource management tools such as UGE and IBM Spectrum LSF.

For more information about getting started, see Job scheduler.

If you haven’t used job scheduling before, you should perform some simple experiments first to understand how it works. For example, take a single server and install the resource manager. Then try running some simple jobs using the cores on the server.

The following subsections discusses how one might install and use a job manager on a DGX system. The example uses SLURM but the process generally applies to any job manager that works with GPUs.

6.1. Example: Using SLURM

Attention:DGX systems do not come prepackaged or pre-installed with job schedulers, although NPN (NVIDIA Partner Network) partners may package specific job schedulers. NVIDIA Support may request that you disable or remove the resource manager for debugging purposes. They may also ask for a factory image to be installed. Without these changes, NVIDIA Support will not be able to continue with debugging process.

As an example, let's say SLURM is installed and configured on a DGX-2, DGX-1, or DGX Station. The first step is to plan how you want to use the DGX system. The first, and by far the easiest configuration, is to assume that a user gets exclusive access to the entire node. In the case the user gets the entire DGX system, i.e. access to all GPUs and CPU cores is given. No other users can use the resources while the first user is using them.

The second way, is to make the GPUs a consumable resource. The user will then ask for the number of GPUs they need ranging from 1 to 8 for the DGX-1 and 1 to 16 for the DGX-2.

There are two public git repositories containing information on SLURM and GPUs, that can help you get started with scheduling jobs.
Note: You may have to configure SLURM to match your specifications.

At a high level, there are two basic options for configuring SLURM with GPUs and DGX systems. The first is to use what is called exclusive mode access and the second allows each GPU to be scheduled independently of the others.

6.1.1. Simple GPU Scheduling With Exclusive Node Access

If you're not interested in allowing simultaneous multiple jobs per compute node, you many not necessarily need to make SLURM aware of the GPUs in the system, and the configuration can be greatly simplified.

One way of scheduling GPUs without making use of GRES (Generic REsource Scheduling) is to create partitions or queues for logical groups of GPUs. For example, grouping nodes with P100 GPUs into a P100 partition would result in something like the following:
$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
p100     up   infinite         4/9/3/16  node[212-213,215-218,220-229]
The corresponding partition configuration via the SLURM configuration file, slurm.conf, would be something like the following:
NodeName=node[212-213,215-218,220-229]
PartitionName=p100 Default=NO DefaultTime=01:00:00 State=UP Nodes=node[212-213,215-218,220-229]

If a user requests a node from the p100 partition, then they would have access to all of the resources in that node, and other users would not. This is what is called exclusive access.

This approach can be advantageous if you are concerned that sharing resources might result in performance issues on the node or if you are concerned about overloading the node resources. For example, in the case of a DGX-1, if you think multiple users might overwhelm the 8TB NFS read cache, then you might want to consider using exclusive mode. Of if you are concerned that the users may use all of the physical memory causing page swapping with a corresponding reduction in performance, then exclusive mode might be useful.

6.1.2. Scheduling Resources At The Per GPU Level

A second option for using SLURM, is to treat the GPUs like a consumable resource and allow users to request them in integer units (i.e. 1, 2, 3, etc.). SLURM can be made aware of GPUs as a consumable resource to allow jobs to request any number of GPUs. This feature requires job accounting to be enabled first; for more info, see Accounting and Resource Limits. A very quick overview is below.

The SLURM configuration file, slurm.conf, needs parameters set to enable cgroups for resource management and GPU resource scheduling. An example is the following:
# General
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

# Scheduling
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

# Logging and Accounting
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres                # show detailed information in Slurm logs about GPU binding and affinity
JobAcctGatherType=jobacct_gather/cgroup
The partition information in slurm.conf defines the available GPUs for each resource. Here is an example:
# Partitions
GresTypes=gpu
NodeName=slurm-node-0[0-1] Gres=gpu:2 CPUs=10 Sockets=1 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=30000 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=48:00:00 DefaultTime=04:00:00 MaxNodes=2 State=UP DefMemPerCPU=3000
The way that resource management is enforced is through cgroups. The cgroups configuration require a separate configuration file, cgroup.conf, such as the following:
CgroupAutomount=yes 
CgroupReleaseAgentDir="/etc/slurm/cgroup" 

ConstrainCores=yes 
ConstrainDevices=yes
ConstrainRAMSpace=yes
#TaskAffinity=yes
To schedule GPU resources requires a configuration file to define the available GPUs and their CPU affinity. An example configuration file, gres.conf, is below:
Name=gpu File=/dev/nvidia0 CPUs=0-4
Name=gpu File=/dev/nvidia1 CPUs=5-9
To run a job utilizing GPU resources requires using the --gres flag with the srun command. For example, to run a job requiring a single GPU the following srun command can be used.
$ srun --gres=gpu:1 nvidia-smi

You also may want to restrict memory usage on shared nodes so that a user doesn’t cause swapping with other user or system processes. A convenient way to do this is with memory cgroups.

Using memory cgroups can be used to restrict jobs to allocated memory resources requires setting kernel parameters. On Ubuntu systems this is configurable via the file /etc/default/grub.
GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"

7. Networking

Networking DGX appliances is an important topic because of the need to provide data to the GPUs for processing. GPUs are remarkably faster than CPUs for many tasks, particularly deep learning. Therefore, the network principles used for connecting CPU servers may not be sufficient for DGX appliances. This is particularly important as the number of DGX appliances grows over time.

DGX-1 Networking

To understand best practices for networking the DGX-1 and for planning for future growth, it is best to start with a brief review of the DGX-1 appliance itself. Recall that the DGX-1 comes with four EDR InfiniBand cards (100 Gb/s each) and two 10Gb/s Ethernet cards (copper). These networking interfaces can be used for connecting the DGX-1 to the network for both communications and storage.
Figure 3. Networking interfaces Networking interfaces

Notice that every two GPUs are connected to a single PCIe switch that is on the system board. The switch also connects to an InfiniBand (IB) network card. To reduce latency and improve throughput, and network traffic from these two GPUs should go to the associated IB card. This is why there are four IB cards in the DGX-1 appliance.

7.1.1. DGX-1 InfiniBand Networking

If you want to use the InfiniBand (IB) network to connect DGX appliances, theoretically, you only have to use one of the IB cards. However, this will push data traffic over the QPI link between the CPUs, which is a very slow link for GPU traffic (i.e. it becomes a bottleneck). A better solution would be to use two IB cards, one connected to each CPU. This could be IB0 and IB2, or IB1 and IB3, or IB0 and IB3, or IB1 and IB2. This would greatly reduce the traffic that has to traverse the QPI link. The best performance is always going to be using all four of the IB links to an IB switch.

The best approach for using IB links to connect all four IB cards to an IB fabric. This will result in the best performance (full bisectional bandwidth and lowest latency) if you are using multiple DGX appliances for training.

Typically, the smallest IB switch comes with 36-ports. This means a single IB switch could accommodate nine (9) DGX-1 appliances using all four IB cards. This allows 400 Gb/s of bandwidth from the DGX-1 to the switch.

If your applications do not need the bandwidth between DGX-1 appliances, you can use two IB connections per DGX-1 as mentioned previously. This allows you to connect up to 18 DGX-1 appliances to a single 36-port IB switch.
Note: It is not recommended to use only a single IB card, but if for some reason that is the configuration, then you can connect up to 36 DGX-1 appliances to a single switch.

For larger numbers of DGX-1 appliances, you will likely have to use two levels of switching. The classic HPC configuration is to use 36-port IB switches for the first level (sometimes called leaf switches) and connect them to a single large core switch, which is sometimes called a director class switch. The largest director class InfiniBand switch has 648 ports. You can use more than one core switch but the configuration will get rather complex. If this is something you are considering, please contact your NVIDIA sales team for a discussion.

For two tiers of switching, if all four IB cards per DGX-1 appliance are used to connect to a 36-port switch, and there is no over-subscription, the largest number of DGX-1 appliances per switch is 4. This is 4 ports from each DGX-1 into the switch for a total of 16. Then, there are 16 uplinks from the leaf switch to the core switch (the director class switch). A total of 40x 36-port leaf switches can be connected to the 648-port core switch (648/16). This results in 160 DGX-1 appliances being connected with full bi-sectional bandwidth.

You can also use what is termed over-subscription in designing the IB network. Over-subscription means that the bandwidth from an uplink is less than the bandwidth coming into the unit (in other words, poorer bandwidth performance). If we use 2:1 over-subscription from the DGX-1 appliances to the first level of switches (36-port leaf switches), then each DGX-1 appliance is only using two IB cards to connect to the switches. This results in less bandwidth than if we used all four cards and also higher latency.

If we keep the network bandwidth from the leaf switches to the core directory switch as 1:1 (in other words, no over-subscription, full bi-sectional bandwidth), then we can put nine (9) DGX-1 appliances into a single leaf switch (a total of 18 ports into the leaf switch from the DGX appliances and 18 uplink ports to the core switch). The result is that a total of 36 leaf switches can be connected to the core switch. This allows a grand total of 324 DGX-1 appliances to be connected together.

You can tailor the IB network even further by using over-subscription from the leaf switches to the core switch. This can be done using four IB connections to a leaf switch from each DGX appliance and then doing 2:1 over-subscription to the core switch or even using two IB connections to the leaf switches and then 2:1 over-subscription to the core switch. These designs are left up to the user to determine but if this is something you want to consider, please contact your NVIDIA sales team for a discussion.

Another important aspect of InfiniBand networking is the Subnet Manager (SM). The SM simply manages the IB network. There is one SM that manages the IB fabric at any one time but you can have other SM’s running and ready to take over if the first SM crashes. Choosing how many SM’s to run and where to run them can have a major impact on the design of the cluster.

The first decision to make is where you want to run the SM’s. They can be run on the IB switches if you desire. This is called hardware SM since it runs on the switch hardware. The advantage of this is that you do not need any other servers which could also run the SM. Running the SM on a node is called a software SM. A disadvantage to running a hardware SM is that if the IB traffic is large, the SM could have a difficult time. For lots of IB traffic and for larger networks, it is a best practice to use a software SM on a dedicated server.

The second decision to make is how many SM’s you want to run. At a minimum, you will have to run one SM. The least expensive solution is to run a single hardware SM. This will work fine for small clusters of DGX-1 appliances (perhaps 2-4). As the number of units grow, you will want to consider running two SM’s at the same time to get HA (High Availability) capability. The reason you want HA is that more users are on the cluster and having it go down has a larger impact than just a small number of appliances.

As the number of appliances grow, consider running the SM’s on dedicated servers (software SM). You will also want to run at least two SM’s for the cluster. Ideally, this means two dedicated servers for the SM’s, but there may be a better solution that solves some other problems; a master node.

7.1.2. DGX-1 Ethernet Networking

Each DGX-1 system comes with two 10Gb/s NICs. These can be used to connect the systems to the local network for a variety of functions such as logins and storage traffic. As a starting point, it is recommended to push NFS traffic over these NICs to the DGX-1. You should monitor the impact of IO on the performance of your models in this configuration.

If you need to go to more than one level of Ethernet switching to connect all of the DGX-1 units and the storage, be careful of how you configure the network. More than likely, you will have to enable the spanning tree protocol to prevent loops in the network. The spanning tree protocol can impact network performance, therefore, you could see a decrease in application performance.

The InfiniBand NICs that come with the DGX-1 can also be used as Ethernet NICs running TCP. The ports on the cards are QSFP28 so you can plug them into a compatible Ethernet network or a compatible InfiniBand network. You will have to add some software to the appliance and change the networking but you can use the NICs as 100GigE Ethernet cards.

For more information, see Switch InfiniBand and Ethernet in DGX-1.

7.1.3. DGX-1 Bonded NICs

The DGX-1 provides two 10GbE ports. Out of the factory these two ports are not bonded but they can be bonded if desired. In particular, VLAN Tagged, Bonded NICs across the two 10 GbE cards can be accomplished.

Before bonding the NICs together, ensure you are familiar with the following:
  • Ensure your network team is involved because you will need to choose a bonding mode for the NICs.
  • Ensure you have a working network connection to pull down the VLAN packages. To do so, first setup a basic, single NIC network (no VLAN/bonding) connection and download the appropriate packages. Then, reconfigure the switch for LACP/VLANs.
Tip: Since the networking goes up and down throughout this process, it's easier to work from a remote console.
The process below walks through the steps of an example for bonding the two NICs together.
  1. Edit the /etc/network/interfaces file to setup an interface on a standard network so that we can access required packages.
    auto em1
    	iface em1 inet static
    	   address 10.253.0.50
     	   netmask 255.255.255.0
     	   network 10.253.0.0
     	   gateway 10.253.0.1
     	   dns-nameservers 8.8.8.8
  2. Bring up the updated interface.
    sudo ifdown em1 && sudo ifup em1
  3. Pull down the required bonding and VLAN packages.
    sudo apt-get install vlan
    sudo apt-get install ifenslave
  4. Shut down the networking.
    sudo stop networking
  5. Add the following lines to /etc/modules to load appropriate drivers.
    sudo echo "8021q" >> /etc/modules
    sudo echo "bonding" >> /etc/modules
  6. Load the drivers.
    sudo modprobe 8021q
    sudo modprobe bonding
  7. Reconfigure your /etc/network/interfaces file. There are some configuration parameters that will be customer network dependent and you will want to work with one of your network engineers.
    The following example creates a bonded network over em1/em2 with IP 172.16.1.11 and VLAN ID 430. You specify the VLAN ID in the NIC name (bond0.###). Also notice that this example uses a bond-mode of 4. Which mode you use is up to you and your situation.
    auto lo
    iface lo inet loopback
    
    
    # The following 3 sections create the bond (bond0) and associated network ports (em1, em2)
    auto bond0
    iface bond0 inet manual
    bond-mode 4
    bond-miimon 100
    bond-slaves em1 em2
     
    auto em1
    iface em1 inet manual
    bond-master bond0
    bond-primary em1
     
    auto em2
    iface em2 inet manual
    bond-master bond0
    
    
    # This section creates a VLAN on top of the bond.  The naming format is device.vlan_id
    auto bond0.430
    iface bond0.430 inet static
    address 172.16.1.11
    netmask 255.255.255.0
    gateway 172.16.1.254
    dns-nameservers 172.16.1.254
    dns-search company.net
    vlan-raw-device bond0
  8. Restart the networking.
    sudo start networking
  9. Bring up the bonded interfaces.
    ifup bond0
  10. Engage your network engineers to re-configure LACP and VLANs on switch.
  11. Test the configuration.

DGX-2 Networking

Because there are more network devices in the DGX-2 relative to the DGX-1 and DGX Station, and they can be used in different ways, to learn more about DGX-2 networking, see the DGX-2 User Guide.

DGX-2 KVM Networking

Introduction

This chapter describes the standard and most commonly used network configurations for KVM guest GPU VMs running on the NVIDIA® DGX-2™ server.  All the network configurations described in this document are based on Netplan - the preferred network configuration method for Ubuntu 18.04-based systems such as the DGX-2 server.

Network Configuration Options

The two common network configurations are "Virtual Network" and "Shared Physical Device". The former is identical across all Linux distributions and available out-of-the-box. The latter needs distribution-specific manual configuration.

The type of network configuration suitable for any deployment depends on the following factors:

  • Whether the guest VM needs to be accessible by users outside of the DGX-2 KVM host
  • Type of network services hosted by the guest VM
  • Number of available public IPv4 and IPv6 addresses
  • What kind of security is required for the guest VM

The rest of this document describes the following network configurations in detail.

  • Virtual Network
  • Bridged Network
  • SR-IOV

Acronyms

  • KVM - Linux Kernel based Virtual Machine
  • NAT - Network Address Translation
  • DHCP - Dynamic Host Configuration Protocol
  • SR-IOV - Single Root IO Virtualization
  • QOS - Quality of Service
  • MTU - Maximum Transmission Unit

Virtual Networking

Libvirt virtual networking uses the concept of a virtual network switch, also known as Usermode Networking. A virtual network switch is a software construct that operates on a physical server host to which guest VMs connect. By default, it operates in NAT mode. The network traffic for a guest VM is directed through this switch, and consequently all guest VMs will use the Host IP address of the connected Physical NIC interface when communicating with the external world.

Default Configuration

  • The Linux host physical server represents a virtual network switch as a network interface.

    When the libvirtd daemon (libvirtd) is first installed and started, the default network interface representing the virtual network switch is virbr0.

  • By default, an instance of dnsmasq server is automatically configured and started by libvirt for each virtual network switch needing it.

    It is responsible for running a DHCP server (to decide which IP address to lease to each VM) and a DNS server (to respond to queries from VMs).

  • In the default virtual network switch configuration, the guest OS will get an IP address in the 192.168.122.0/24 address space and the host OS will be reachable at 192.168.122.1.

    You should be able to SSH into the host OS (at 192.168.122.1) from inside the guest OS and use SCP to copy files back and forth

  • In the default configuration, the guest OS will have access to network services, but will not be visible to other machines on the network.

    For example, the guest VM will be able to browse the web, but will not be able to host an accessible web server.

  • You can create additional virtual networks using the steps described in the latter part of this section except you must use a different range of DHCP IP addresses.

    For example, 192.168.123.0/24.

                                                                 

The following are limitations of the Virtual Network Configuration when used in NAT mode.

  • Guest VMs are not accessible from an external network.
  • Guest VMs will communicate with an external network using the Host IP address of the connected Physical NIC interface.

When using this configuration, you may encounter certain restrictions such as connection timeouts due to the number of active connections per Host/IP address, especially if all guest VMs are communicating with the same server at the same time. It also depends on the features and restrictions enforced on the server side.

net/http: request canceled while waiting for connection(Client.Timeout exceeded while awaiting headers)

If the default configuration is suitable for your purposes, no other configuration is required.

A couple of advance virtual network configurations can be used for better network performance. Refer to Improving Network Performance for more details.

Verifying the Host Configuration

Every standard libvirt installation provides NAT-based connectivity to virtual machines out of the box. This is referred to as the 'default virtual network'. Verify that it is available with the virsh net-list --all command.

$ virsh net-list --all
Name                 State      Autostart
-----------------------------------------
default              active     yes

If the default network is missing, the following example XML configuration file can be reloaded and activated.

$ virsh net-dumpxml default
<network>
  <name>default</name>
  <uuid>92d49672-3020-40a1-90f5-73fe07216122</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr0' stp='on' delay='0'/>
  <mac address='52:54:00:40:cc:23'/>
  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.122.2' end='192.168.122.254'/>
    </dhcp>
  </ip>
</network>

In the above XML contents, “default” is the name of the virtual network, and “virbr0” is the name of the virtual network switch.

$ virsh net-define /etc/libvirt/qemu/networks/default.xml
The default network is defined from /etc/libvirt/qemu/networks/default.xml

Mark the default network to automatically start:

$ virsh net-autostart default
Network default marked as autostarted

Start the default network:

$ virsh net-start defaultNetwork default started

Once the libvirt default virtual network is running, you will see a virtual network switch device. This device does not have any physical interfaces added, since it uses NAT and IP forwarding to connect to outside world. This virtual network switch will just use whatever Physical NIC interface that is being used by Host. Do not add new interfaces.

$ brctl show
bridge name     bridge id               STP enabled     interfaces
virbr0          8000.000000000000       yes

Once the host configuration is complete, a guest can be connected to the virtual network based on its name or bridge. To connect a guest VM to using virtual bridge name “virbr0”, the following XML can be used in the virsh configuration for the guest VM:

<interface type='bridge'>
  <source bridge='virbr0'/>
  <model type='virtio'/>
</interface>

Using Static IP

You can reserve and allocate static IP for the specific guest VMs from the default DHCP range (192.168.122.2 - 192.168.122.254) of the virtual network switch. Also, you should exclude those reserved/assigned static IP addresses from the DHCP ranges.

Configurations Made from the Host

To use static IP addressing, check the Mac address of the guest VM.

$ virsh edit 1gpu-vm-1g0
<domain type='kvm' id='3'>
  <name>1gpu-vm-1g0</name>
  <uuid>c40f6b9d-ea15-45b0-ac42-83801eef73d4</uuid>
  ……..
   <interface type='bridge'>
    <mac address='52:54:00:e1:28:3e'/>
    <source bridge='virbr0'/>
    <model type='virtio'/>
    <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
  </interface>
  ….
</domain>
$ virsh net-edit default
<network>
  <name>default</name>
  <uuid>92d49672-3020-40a1-90f5-73fe07216122</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='virbr0' stp='on' delay='0'/>
  <mac address='52:54:00:40:cc:23'/>
  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.122.100' end='192.168.122.254'/>
      <host mac='52:54:00:e1:28:3e' ip='192.168.122.45'/>
    </dhcp>
  </ip>
</network>
$ virsh net-destroy default
$ virsh net-start default

Start/restart the guest VM after updating the “default” virtual network with the guest Mac address.

$ virsh net-dhcp-leases default
Expiry Time         MAC address        Protocol  IP address        Hostname Client ID or DUID
------------------------------------------------------------------------------------------------
2018-08-29 13:18:58 52:54:00:e1:28:3e  ipv4      192.168.122.45/24 1gpu-vm-1g0

Binding the Virtual Network to a Specific Physical NIC

KVM will use virtual network switch as the default networking configuration for all guest VMs and it will operate in NAT mode. The network traffic for a guest is directed through this switch, and consequently all guests will use one of the Host Physical NIC interface while communicating with the external world.  By default, it is not bound to any specific physical NIC interface but you can restrict the virtual network switch to use a specific physical NIC interface; for example, you can limit the virtual network to use enp6s0 only.

$ virsh net-edit default
<network>
  <name>default</name>
  <uuid>92d49672-3020-40a1-90f5-73fe07216122</uuid>
  <forward dev='enp6s0'  mode='nat' />
  <bridge name='virbr0' stp='on' delay='0'/>
  <mac address='52:54:00:40:cc:23'/>
  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.122.2' end='192.168.122.254'/>
    </dhcp>
  </ip>
</network>
$ virsh net-destroy default
$ virsh net-start default

Start/restart the guest after updating “default” virtual network configuration.

Bridged Networking

Introduction

A bridged network shares a real Ethernet device with KVM guest VMs. When using Bridged mode, all the guest virtual machines appear within the same subnet as the host physical machine. All other physical machines on the same physical network are aware of, and can access, the virtual machines. Bridging operates on Layer 2 of the OSI networking model.

Each guest VM can bind directly to any available IPv4 or IPv6 addresses on the LAN, just like a physical server. Bridging offers the best performance with the least complication out of all the libvirt network types. A bridge is only possible when there are enough IP addresses to allocate one per guest VM. This is not a problem for IPv6, as hosting providers usually provide many free IPv6 addresses. However, extra IPv4 addresses are rarely free.

Using DHCP

Configuration from the Host

$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    enp134s0f0:
      dhcp4: yes
  bridges:
    br0:
      dhcp4: yes
      interfaces: [ enp134s0f0 ]
Note: Use the Host NIC interface (Ex: enp134s0f0) that is connected on your system.
$ sudo netplan apply

Guest VM Configuration

Once the host configuration is complete, a guest can be connected to the bridged network based on its name. To connect a guest to the 'br0' network, the following XML can be used for the guest:

$ virsh edit <VM name or ID> 
<interface type='bridge'>
  <source bridge=br0/>
  <model type='virtio'/></interface>
$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    virtionetworks:
      match:
          driver: virtio_net
          name: en*
      dhcp4: yes
$ sudo netplan apply

Refer to Getting the Guest VM IP Address for instructions on how to determine the guest VM IP address.

Using Static IP

Host Configuration

$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    enp134s0f0:
      dhcp4: no
  bridges:
    br0:
      dhcp4: no
      addresses: [ 10.33.14.17/24 ]
      gateway4: 10.33.14.1
      nameservers:
          search: [ nvidia.com ]
          addresses: [ 172.16.200.26, 172.17.188.26 ]
      interfaces: [ enp134s0f0 ]
Note: Use the Host NIC interface (Ex: enp134s0f0) that you have connected to your network. Consult your network administrator for the actual IP addresses of your guest VM.
$ sudo netplan apply

Guest VM Configuration

Once the host configuration is complete, a guest VM can be connected to the bridged network based on its name. To connect a guest VM to the 'br0' network, the following XML can be used for the guest VM:

$ virsh edit <VM name or ID> 
<interface type='bridge'>
  <source bridge=br0/>
  <model type='virtio'/>
</interface>
$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    virtionetworks:
      match:
          driver: virtio_net
          name: en*
      dhcp4: no
      dhcp6: no
      addresses: [ 10.33.14.18/24 ]
      gateway4: 10.33.14.1
      nameservers:
          search: [ nvidia.com ]
          addresses: [ 172.16.200.26, 172.17.188.26 ]
$ sudo netplan apply

Refer to Getting the Guest VM IP Address for instructions on how to determine the guest VM IP address.

Bridged Networking with Bonding

Introduction

Network bonding refers to the combining of multiple physical network interfaces on one host for redundancy and/or increased throughput. Redundancy is the key factor: we want to protect our virtualized environment from loss of service due to failure of a single physical link. This network bonding is the same as the Linux network bonding. The bond is added to a bridge and then guest virtual machines are added onto the bridge, similar bridged mode as discussed in Bridged Networking. However, the bonding driver has several modes of operation, and only a few of these modes work with a bridge where virtual guest machines are in use.

There are three key modes of network bonding:

  • Active-Passive: there is one NIC active while another NIC is asleep. If the active NIC goes down, another NIC becomes active.
  • Link Aggregation: aggregated NICs act as one NIC which results in a higher throughput.
  • Load Balanced: the network traffic is equally balanced over the NICs of the machine.

The following section explains the bonding configuration based on IEEE 802.3 link aggregation. This mode is also known as a Dynamic Link Aggregation mode that creates aggregation groups having same speed. It requires a switch that supports IEEE 802.3ad dynamic link.

Using DHCP

Configuration from the Host

$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    bond-ports:
      dhcp4: no
      match:
        name: enp134*
  bonds:
    bond0:
      dhcp4: no
      interfaces: [ bond-ports ]
      parameters:
        mode: 802.3ad
  bridges:
    br0:
      dhcp4: yes
      interfaces: [ bond0 ]
 
Note: Use the Host NIC interface (Ex: enp134*) based on what is connected on your system.
$ sudo netplan apply

Guest VM Configuration

Once the host configuration is complete, a guest can be connected to the bridged network based on its name. To connect a guest to the 'br0' network, the following XML can be used in the guest:

$ virsh edit <VM name or ID> 
<interface type='bridge'>
  <source bridge=br0/>
  <model type='virtio'/>
</interface>
$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    virtionetworks:
      match:
          driver: virtio_net
          name: en*
      dhcp4: yes
$ sudo netplan apply  

Refer to Getting the Guest VM IP Address for instructions on how to determine the guest VM IP address.

Using Static IP

Host Configuration

$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    bond-ports:
      dhcp4: no
      match:
        name: enp134*
  bonds:
    bond0:
      dhcp4: no
      interfaces: [ bond-ports ]
      parameters:
        mode: 802.3ad
  bridges:
    br0:
     addresses: [ 10.33.14.17/24 ]
      gateway4: 10.33.14.1
      nameservers:
          search: [ nvidia.com ]
          addresses: [ 172.16.200.26, 172.17.188.26 ]
      interfaces: [ bond0 ]
Note: Use the Host NIC interface (Ex: enp134*) based on what is connected on your system.
$ sudo netplan apply

Guest VM Configuration

Once the host configuration is complete, a guest can be connected to the bridged network based on its name. To connect a guest to the 'br0' network, the following XML can be used in the guest:

$ virsh edit <VM name or ID> 
<interface type='bridge'>
  <source bridge=br0/>
  <model type='virtio'/>
</interface>
$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system# For more information, see netplan(5).# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    virtionetworks:
      match:
          driver: virtio_net
          name: en*
      dhcp4: no
      dhcp6: no
      addresses: [ 10.33.14.18/24 ]
      gateway4: 10.33.14.1
      nameservers:
          search: [ nvidia.com ]
          addresses: [ 172.16.200.26, 172.17.188.26 ]
$ sudo netplan apply

Refer to Getting the Guest VM IP Address for instructions on how to determine the guest VM IP address.

SR-IOV

Introduction

The SR-IOV technology is a hardware-based virtualization solution that improves both performance and scalability. The SR-IOV standard enables efficient sharing of PCIe (Peripheral Component Interconnect) Express devices among virtual machines and is implemented in the hardware to achieve I/O performance which is comparable to native performance. The SR-IOV specification defines a new standard wherein the new devices that are created will enable the virtual machine to be directly connected to the I/O device.

The SR-IOV specification is defined and maintained by PCI-SIG at http://www.pcisig.com.

A single I/O resource can be shared by many virtual machines. The shared devices will provide dedicated resources and also utilize shared common resources. In this way, each virtual machine will have access to unique resources. Therefore, a PCIe device, such as an Ethernet Port, that is SR-IOV enabled with appropriate hardware and OS support can appear as multiple, separate physical devices, each with its own configuration space.

The following figure illustrates the SR-IOV technology for PCIe hardware.

Two new function types in SR-IOV are:

Physical Function (PF)

A PCI Function that supports the SR-IOV capabilities as defined in SR-IOV specification. A PF contains the SR-IOV capability structure and is used to manage the SR-IOV functionality. PFs are fully-featured PCIe functions that can be discovered, managed, and manipulated like any other PCIe device. PFs have full configuration resources and can be used to configure or control the PCIe device.

Virtual Function (VF)

A Virtual Function is a function that is associated with a Physical Function. A VF is a lightweight PCIe function that shares one or more physical resources with the Physical Function and with other VFs that are associated with the same PF. VFs are only allowed to have configuration resources for its own behavior.

An SR-IOV device can have hundreds of Virtual Functions (VFs) associating with a Physical Function (PF). The creation of VFs can be dynamically controlled by the PF through registers designed to turn on the SR-IOV capability. By default, the SR-IOV capability is turned off, and the PF behaves as traditional PCIe device.

The following are the advantages and disadvantages of SR-IOV.

  • Advantages
    • Performance – Direct access to hardware from virtual machines environment and benefits include:
      • Lower CPU utilization

      • Lower network latency

      • Higher network throughput

    • Cost Reduction - Capital and operational expenditure savings include:
      • Power savings

      • Reduced adapter count

      • Less cabling

      • Fewer switch ports

  • Disadvantages
    • Guest VM Migration - Harder to migrate guest from one physical server to another

      There are several proposals being used or implemented in the industry and each has its own merit/demerits.

Device Configuration

SR-IOV and VFs are not enabled by default in all devices. For example, the dual port 100GbE Mellanox card in the DGX-2 doesn’t have VFs enabled by default. Follow the instructions in section 5 of the Mellanox SR-IOV NIC Configuration guide to enable the SR-IOV and the desired number of functions in firmware.

Generic Configuration

Use the following steps to enable SR-IOV in KVM host, as it will define a pool of virtual function (VF) devices associated with a physical NIC and automatically assign VF device to each guest from the pool to VF BDFs.
Configuration from the Host
  1. Define a network for a pool of VFs.
  2. Read the supported number of VFs.
    $ cat /sys/class/net/enp134s0f0/device/sriov_totalvfs63
  3. Enable the required number of VFs (Ex: 16).
    $ sudo echo 16 > /sys/class/net/enp134s0f0/device/sriov_numvfs
  4. Create a new SR-IOV network. Generate an XML file with text similar to the following example.
    $ sudo vi /etc/libvirt/qemu/networks/iovnet0.xml
    <network>
         <name>iovnet0</name>
         <forward mode='hostdev' managed='yes'>
            <pf dev='enp134s0f0'/>
         </forward>
     </network>
    Note: Note: Change the value of pf dev to the ethdev (Ex: enp134s0f0) corresponding to you SR-IOV device’s physical function.
  5. Execute the following commands
    $ virsh net-define /etc/libvirt/qemu/networks/iovnet0.xml
    $ virsh net-autostart iovnet0
    $ virsh net-start iovnet0

    Guest VM Configuration

    After the defining and starting SR-IOV (iovnet0) network, modify the guest XML definition to specify the network.

    $ virsh edit <VM name or ID>
    <interface type='network'>    <source network='iovnet0'/> </interface>

    When the guest VM starts, a VF is automatically assigned to the guest VM.  If the guest VM is already running, you need to restart it.

Guest VM Configuration

After the defining and starting SR-IOV (iovnet0) network, modify the guest XML definition to specify the network.

$ virsh edit <VM name or ID>
<interface type='network'>
    <source network='iovnet0'/>
</interface>

When the guest VM starts, a VF is automatically assigned to the guest VM.  If the guest VM is already running, you need to restart it.

Using DHCP

Configuration from the Host

$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    enp134s0f0:
      dhcp4: yes
Note: Use the Host NIC interface (Ex: enp134s0f0) based on what is connected on your system.
$ sudo netplan apply

Guest VM Configuration

$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    enp8s0:
      dhcp4: yes
Note: Use the guest VM NIC interface (Ex: enp8s0) by checking “ifconfig -a” output.
$ sudo netplan apply

Refer to Getting the Guest VM IP Address for instructions on how to determine the guest VM IP address.

Using Static IP

Configuration from the Host

$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    enp134s0f0:
      dhcp4: no
      addresses: [ 10.33.14.17/24 ]
      gateway4: 10.33.14.1
      nameservers:
          search: [ nvidia.com ]
          addresses: [ 172.16.200.26, 172.17.188.26 ]
Note: Use Host NIC interface (Ex: enp134s0f0) based on what is being connected on your system.
$ sudo netplan apply

Guest VM Configuration

$ sudo vi /etc/netplan/01-netcfg.yaml
# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    enp8s0:
      dhcp4: no
      addresses: [ 10.33.14.18/24 ]
      gateway4: 10.33.14.1
      nameservers:
          search: [ nvidia.com ]
          addresses: [ 172.16.200.26, 172.17.188.26 ]
Note: Use guest VM NIC interface (Ex: enp8s0) by checking “ifconfig -a” output.
$ sudo netplan apply

Refer to Getting the Guest VM IP Address for instructions on how to determine the guest VM IP address.

Getting the Guest VM IP Address

If you are using Bridged and SR-IOV network configurations, use the following steps to determine the guest VM IP address from the Host.

Install and configure QEMU Guest Agent to retrieve the guest VM IP address. The QEMU guest agent runs inside the guest VM and allows the host machine to issue commands to the guest VM operating system using libvirt. The guest VM operating system then responds to those commands asynchronously.

Note: Note: It is only safe to rely on the guest agent when run by trusted guests. An untrusted guest may maliciously ignore or abuse the guest agent protocol, and although built-in safeguards exist to prevent a denial of service attack on the host, the host requires guest co-operation for operations to run as expected.

Configuration from the Host

Add the following lines to guest VM XML file under <devices> using

$ virsh edit <VM name or ID> 
    <channel type='unix'>
        <target type='virtio' name='org.qemu.guest_agent.0'/>
    </channel>

Guest VM Configuration

$ sudo apt-get install qemu-guest-agent
$ virsh shutdown <VM name or ID>
$ virsh start <VM name or ID>

After these steps, run the following command in the Host to check a specific guest VM IP address.

$ virsh domifaddr <VM name or ID> --source agent
Name       MAC address          Protocol     Address
-----------------------------------------------------------------------
 lo         00:00:00:00:00:00    ipv4        127.0.0.1/8
 -          -                    ipv6        ::1/128
 enp1s0     52:54:00:b2:d9:a7    ipv4        10.33.14.18/24
 -          -                    ipv6        fe80::5054:ff:feb2:d9a7/64
 docker0    02:42:3e:48:87:61    ipv4        172.17.0.1/16

Improving Network Performance

This section describes some ways to improve network performance.

Jumbo Frames

A jumbo frame is an Ethernet frame with a payload greater than the standard maximum transmission unit (MTU) of 1,500 bytes.  Jumbo frames are used on local area networks that support at least 1 Gbps and can be as large as 9,000 bytes. Enabling jumbo frames can improve network performance by making data transmissions more efficient. The CPUs on Switches and Routers can only process one frame at a time. By putting a larger payload into each frame, the CPUs have fewer frames to process. Jumbo frames should be enabled only if each link in the network path, including servers and endpoints, is configured to enable jumbo frames at the same MTU. Otherwise, performance may decrease as incompatible devices drop frames or fragment them; the latter which can task the CPU with higher processing requirements.

In the case of a libvirt-managed network (one with forward mode of NAT, Route), this will be the MTU assigned to the bridge device (virbr0) when libvirt creates it, and thereafter also assigned to all tap devices created to connect guest interfaces. If MTU is unspecified, the default setting for the type of device being used is assumed and it is usually set to 1500 bytes.

We can enable jumbo frame configuration for the default virtual network switch using the following commands. All guest virtual network interfaces will inherit jumbo frame or MTU of 9000 Bytes configuration.

$ virsh net-edit default
<network>
  <name>default</name>
  <uuid>a47b420d-608e-499a-96e4-e75fc45e60c4</uuid>
  <forward mode='nat'/>
  <bridge name='virbr0' stp='on' delay='0'/>
  <mtu size='9000'/>
  <mac address='52:54:00:f2:e3:2a'/>
  <ip address='192.168.122.1' netmask='255.255.255.0'>
     <dhcp>
     <range start='192.168.122.2' end='192.168.122.254'/>
     </dhcp>
  </ip>
</network>
$ virsh net-destroy default
$ virsh net-start default

Multi-Queue Support

This section describes multi-queue and, for KVM packages prior to dgx-kvm-image-4-0-3, provides instructions for enabling multi-queue. Starting with dgx-kvm-image-4-0-3, multi-queue is enabled by default.    

The KVM guest VM will use virtio-net driver when it is using network interface based on Virtual Network Switch either in NAT or Bridged mode. By default, this virtio-net driver will use one pair of TX and RX queues and this can limit the guest network performance, even though it may be configured to use multiple vCPUs and their network interface is bound to 10/100G Host Physical NIC. Multi-queue support in virtio-net driver will

  • Enables packet sending/receiving processing to scale with the number of available virtual CPUs in a guest
  • Each guest virtual CPU can have a its own separate TX and RX queue and interrupts that can be used without influencing other virtual CPUs.
  • Provides better application scalability and improved network performance in many cases.

Multi-queue virtio-net provides the greatest performance benefit when:

  • Traffic packets are relatively large.
  • The guest is active on many connections at the same time, with traffic running between guests, guest to host, or guest to an external system.
  • The number of queues is equal to the number of vCPUs. This is because multi-queue support optimizes RX interrupt affinity and TX queue selection in order to make a specific queue private to a specific vCPU.

Note: Multi-queue virtio-net works well for incoming traffic, but can occasionally hurt performance for outgoing traffic. Enabling multi-queue virtio-net increases the total throughput, and in parallel increases CPU consumption.

To use multi-queue virtio-net, enable support in the guest by adding the following to the guest XML configuration (where the value of N is from 1 to 256, as the kernel supports up to 256 queues for a multi-queue tap device). For the best results, match the number of queues with number of vCPU cores configured on the VM.

Note: This is not needed with KVM image dgx-kvm-image-4-0-3 or later.

 

$ virsh edit <VM name or ID>
<interface type='bridge'>
       <source bridge='virbr0'/>
       <model type='virtio'/>
       <driver name='vhost' queues='N'/
</interface>

When running a virtual machine with N virtio-net queues in the guest VM, you can check the number of enabled queues using

$ ethtool -L <interface> 
$ /sys/class/net/<interface>/queues

You can change the number of enabled queues (where the value of M is from 1 to N):

$ ethtool -L <interface> combined M
Note: When using multi-queue, it is recommended to change the max_files variable in the /etc/libvirt/qemu.conf file to 2048. The default limit of 1024 can be insufficient for multi-queue and cause guests to be unable to start when multi-queue is configured.

This is enabled by default on DGX-2 guest VMs for the current release of KVM SW.

QOS

By default, Virtual Network Switch will treat the network traffic from all the guests equally and process them order in which it receive the packets. Virtual machine network quality of service is a feature that allows to limit both the inbound and outbound traffic of individual virtual network interface controllers or guests.

Virtual machine network quality of service settings allow you to configure bandwidth limits for both inbound and outbound traffic on three distinct levels.

  • Average: The average speed of inbound or outbound traffic.  Specifies the desired average bit rate for the interface being shaped (in kilobytes/second).
  • Peak: The speed of inbound or outbound traffic during peak times. Optional attribute which specifies the maximum rate at which the bridge can send data (in kilobytes/second). Note the limitation of implementation: this attribute in the outbound element is ignored (as Linux ingress filters don't know it yet).
  • Burst: The speed of inbound or outbound traffic during bursts. This is an optional attribute which specifies the number of kilobytes that can be transmitted in a single burst at peak speed.

The libvirt domain specification includes this functionality already. You can specify separate settings for incoming and outgoing traffic.  When you open the XML file of your virtual machine, find the block with interface type tag. Try to add the following.

$ virsh edit <VM name or ID>
<bandwidth>
  <inbound average='NNN' peak='NNN' burst='NNN'/>
  <outbound average='NNN' peak='NNN' burst='NNN'/>
</bandwidth>

Where NNN is desired speed in KBS and it can be different for inbound/outbound and also, average/peak/burst can have different values.

This is not enabled by default on DGX-2 guest VMs for the current release of KVM SW.

9. SSH Tunneling

Some environments are not configured or limit access (firewall or otherwise) to computer nodes within an intranet. They are also very useful for running Jupyter notebooks inside containers when working remotely. When running a container with a service or application exposed on a port, such as , remote access must be enabled on the remote system to that port on the DGX system. The following steps use PuTTY to create SSH tunnel from a remote system into the DGX system. If you are using an SSH utility, one can set up tunneling via the -L option.
Note: A PuTTY SSH tunnel session must be up, logged in, and running for tunnel to function. SSH tunnels are commonly used for the following applications (with listed port numbers).
Table 2. Commonly used applications for SSH tunnels
Application Port Notes
5000 If multiple users, each selects own port
VNC Viewer 5901, 6901 5901 for VNC app, 6901 for web app
To create an SSH Tunnel session with PuTTY, perform the following steps:
  1. Run the PuTTY application.
  2. In the Host Name field, enter the host name you want to connect to.
  3. In the Saved Sessions section, enter a name to save the session under and click Save.
  4. Click Category > Connection, click + next to SSH to expand the section.
  5. Click Tunnels for Tunnel configuration.
  6. Add the port for forwarding.
    1. In the Source Port section, enter 5000, which is the port you need to forward for .
  7. In the Destination section, enter localhost:5000 for the local port that you will connect to.
  8. Click Add to save the added Tunnel.
  9. In the Category section, click Session.
  10. In the Saved Sessions section, click the name you previously created, then click Save to save the added Tunnels.
To use PuTTY with tunnels, perform the following steps:
  1. Run the PuTTY application.
  2. In the Saved Sessions section, select the Save Session that you created.
  3. Click Load.
  4. Click Open to start session and login. The SSH tunnel is created and you can connect to a remote system via tunnel. As an example, for , you can start a web browser and connect to http://localhost:5000.

10. Master Node

A master node, also sometimes called a head node, is a very useful server within a cluster. Typically, it runs the cluster management software, the resource manager, and any monitoring tools that are used. For smaller clusters, it is also used as a login node for users to create and submit jobs.

For clusters of any size that include the DGX-2, DGX-1, or even a group of DGX Stations, a master node can be very helpful. It allows the DGX systems to focus solely on computing rather than any interactive logins or post-processing that users may be doing. As the number of nodes in a cluster increases, it is recommended to use a master node.

It is recommended to size the master node for things such as:
  • Interactive user logins
  • Resource management (running a job scheduler)
  • Graphical pre-processing and post-processing
    • Consider a GPU in the master node for visualization
  • Cluster monitoring
  • Cluster management

Since the master node becomes an important part of the operation of the cluster, consider using RAID-1 for the OS drive in the master node as well as redundant power supplies. This can help improve the uptime of the master node.

For smaller clusters, you can also use the master node as an NFS server by adding storage and more memory to the master node and NFS export the storage to the cluster clients. For larger clusters, it is recommended to have dedicated storage, either NFS or a parallel file system.

For InfiniBand networks, the master node can also be used for running the software SM. If you want some HA for the SM, run the primary SM on the master node and use an SM on the IB switch as a secondary SM (hardware SM).

As the cluster grows, it is recommended to consider splitting the login and data processing functions from the master node to one or more dedicated login nodes. This is also true as the number of users grows. You can run the primary SM on the master node and other SM’s on the login nodes. You could even use the hardware SM’s on the switches as backups.

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, DGX, DGX-1, DGX-2, and DGX Station are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.