Configuring Storage

By default, the DGX A100 System includes four SSDs in a RAID 0 configuration. These SSDs are intended for application caching, so you must set up your own NFS storage for long-term data storage. The instructions in this section describe how to mount the NFS on the DGX A100 System and how to cache the NFS using the DGX A100 SSDs for improved performance.

Disabling cachefilesd

The DGX A100 system uses cachefilesd to manage the caching of the NFS. To disable:

$ sudo systemctl stop cachefilesd
$ sudo systemctl disable cachefilesd

Using cachefilesd

The following instructions describe how to mount the NFS onto the DGX A100 system and how to cache the NFS using the DGX A100 SSDs for improved performance.

Make sure that you have an NFS server with one or more exports with data to be accessed by the DGX A100 System and that there is network access between the DGX A100 System and the NFS server.

  1. Configure an NFS mount for the DGX A100 System.

    1. Edit the filesystem tables configuration.

      $ sudo vi /etc/fstab
      
    2. Add a new line for the NFS mount, using the local mount point of /mnt.

      <nfs_server>:<export_path> /mnt nfs rw,noatime,rsize=32768,wsize=32768,nolock,tcp,intr,fsc,nofail 0 0
      
      • /mnt is used here as an example mount point.

      • Consult your Network Administrator for the correct values for <nfs_server> and <export_path>.

      • The nfs arguments presented here are a list of recommended values based on typical use cases.

      However, “fsc” must always be included as that argument specifies use of FS-Cache.

    3. Save the changes.

  2. Verify the NFS server is reachable.

    $ ping <nfs-server-ip-address>
    

    Use the server IP address or the server name provided by your network administrator.

  3. Mount the NFS export.

    $ sudo mount /mnt
    

    /mnt is an example mount point.

  4. Verify caching is enabled.

    $ cat /proc/fs/nfsfs/volumes
    

    In the output, look for FSC=yes.

    The NFS will be automatically mounted and cached on the DGX A100 System in subsequent reboot cycles.

Setting Filesystem Quotas

When running NGC containers, you might need to limit the amount of disk space that is used on a filesystem to avoid filling up the partition.

Refer to https://www.digitalocean.com/community/tutorials/how-to-set-filesystem-quotas-on-ubuntu-18-04 for information about how to set filesystem quotas on Ubuntu 18.04 and later.

Switching Between RAID 0 and RAID 5

As supplied from the factory, the RAID level of the DGX A100 RAID array is RAID 0. which provides the maximum storage capacity but does not provide any redundancy.

If one SSD in the array fails, all data stored on the array is lost. If you are willing to accept reduced capacity in return for some level of protection against failure of a SSD, you can change the level of the RAID array to RAID 5. If you change the RAID level from RAID 0 to RAID 5, the total storage capacity of the RAID array is reduced.

Before you change the RAID level of the DGX A100 RAID array, back up all data on the array that you want to preserve. Changing the RAID level of the DGX A100 RAID array erases all data stored on the array.

The DGX A100 software includes the configure_raid_array.py custom script, which you can use to change the level of the RAID array without unmounting the RAID volume.

  • To change the RAID level to RAID 5, run the following command:

    $ sudo configure_raid_array.py -m raid5
    

    After you change the RAID level to RAID 5, the RAID array is rebuilt. A RAID array that is being rebuilt is online and ready to be used, but a check on the health of the DGX system reports the status of the RAID volume as unhealthy.

    The time required to rebuild the RAID array depends on the workload on the system. On an idle system, the rebuild will take about 30 minutes to complete.

  • To change the RAID level to RAID 0, run the following command:

    $ sudo configure_raid_array.py -m raid0
    

    To confirm that the RAID level was changed as required, run the lsblk command. The entry in the TYPE column for each SSD in the RAID array indicates the RAID level of the array.

Configuring Support for Custom Drive Partitioning

DGX A100 systems incorporate data drives configured as RAID 0 by default. You can alter the default configuration by adding or removing drives, or by switching between a RAID 0 configuration and a RAID 5 configuration.

If you alter the default configuration, you must let NVSM know so that the utility does not flag the configuration as an error, and so that NVSM can continue to monitor the health of the drives.

  1. Edit /etc/nvsm/nvsm.config and set the use_standard_config_storage parameter to false.

    "use_standard_config_storage":false
    
  2. Restart NVSM.

    $ systemctl restart nvsm
    

If you restore the drive partition back to the default configuration, set the parameter back to true.