Cloud-init Configuration File

This section provides instructions for creating a cloud-init configuration file for the Ubuntu Automated Server Installation.

Modifying the Configuration File

The following text is an outline of the example configuration file. It won’t work as is but requires additional modifications as described in the sections. Refer to Ubuntu AUtomated Server Installation for more details.

  1. Begin the configuration file with the following header:

    #cloud-config
    autoinstall:
      version: 1
    
  2. Define a default user (the example uses Ubuntu), localization, and keyboard layout.

    ##
    ## Set initial system and user information
    ## use mkpassword -m sha-512 <password> to create a password
    ##
      identity:
        realname: DGX Ubuntu User
        hostname: dgx-host
        password: <PASSWORD HASH>
        username: ubuntu
      locale: en_US
      keyboard:
        layout: en
        variant: us
      reporting:
        builtin:
          type: print
    
  3. The network section describes the network configuration and supports fixed addresses, DHCP, and various other network options. The names of the network interfaces are system-dependent. These are the primary management ports for various DGX systems. For example:

    • DGX-1: enp1s0f0

    • DGX-2: enp6s0

    • DGX A100: enp226s0

    ##
    ## Network Configuration
    ##
      network:
        version: 2
        ethernets:
          enp1s0f0:
            dhcp4: yes
    
  4. Update the Subiquity installer to the edge channel. The NVIDIA repositories require to also set up Apt preferences, which is not supported by the version of Subiquity that is shipped with Ubuntu 20.04 ISO images.

    refresh-installer:
      channel: edge
      update: yes
    
  5. Provide details about the additional NVIDIA repositories. Refer to Drive Partitioning below for more information.

    ##
    ## Enable this for using the remote repositories
    ##
      apt:
        <Repository details for the CUDA Compute and DGX Repository>
    
        conf: |
          Dpkg::Options {
            "--force-confdef";
            "--force-confold";
    
  6. Configure storage.

    The next section describes the storage configuration, including swap configuration and drive partitioning. By setting the size to 0, we disable the SWAP partition. Refer to Drive Partitioning.

    The reorder_uefi flag tells the installer not to change the boot order to place the currently booted entry (BootCurrent) to the first option.

    ##
    ## Storage Configuration
    ##
      storage:
        config:
          <Partition and other configurations>
        swap:
          size: 0
        grub:
          reorder_uefi: false
    
  7. Enable the SSH server.

    You can also set a default SSH key.

    ##
    ## SSH Server
    ##
      ssh:
        install-server: yes
        allow-pw: yes
    
  8. Provide a list of packages that should be installed.

    Refer to the comments in this text for instructions on changing the package names for specific DGX systems and on enabling or disabling features.

    ##
    ## Packages
    ##
      packages:
    
    ##
    ## NVIDIA DGX system configurations and system tools
    ## Replace dgx-a100 for other DGX systems:
    ##  dgx1     for DGX-1
    ##  dgx2     for DGX-2
    ##  dgx-a100 for DGX A100
    ##
        - dgx-a100-system-configurations
        - dgx-a100-system-tools
        - dgx-a100-system-tools-extra
    
    ## Remove this if you don’t want to use cachefilesd
        - nvidia-conf-cachefilesd
    
    ## Remove this if boot drive encryption is enabled and you don’t
    ## want the passphrase dialog only visible on the serial console
        - nvidia-ipmisol
    
    ##
    ## NVIDIA CUDA driver and tools
    ## Change the driver version to the branch you want to install
    ##
        - datacenter-gpu-manager
        - libnvidia-nscq-450
        - linux-modules-nvidia-450-server-generic
        - nvidia-driver-450-server
        - nvidia-modprobe
        - nv-persistence-mode
    
    ## Uncomment these to support the NVswitch on DGX-2 and DGX A100
    ## Ensure that the driver version matches with the versions above
    #    - libnvidia-nscq-450
    #    - nvidia-fabricmanager-450
    
    ##
    ## Mellanox drivers and tools
    ##
        - mlnx-ofed-all
        - nvidia-mlnx-ofed-misc
    
    ##
    ## NVIDIA container support
    ##
        - docker-ce
        - nv-docker-options
        - nvidia-docker2
    
    ##
    ## NVIDIA system management tools
    ##
        - nvsm
        - nvidia-motd
    
  9. Add any additional software packages you want to install during autoinstall.

  10. Finally, add a list of additional commands to be executed at the end of the installation.

    • Disable unattended upgrades

    • Disable the ondemand governor defaulting to performance mode

    • Enable DCGM and OpenIBD services

    • Enable nv-peer-mem

    ##
    ## Commands executed after completion of the installation
    ##
      late-commands:
        - curtin in-target --target=/target -- apt purge -y unattended-upgrades
        - curtin in-target --target=/target -- systemctl disable ondemand
        - curtin in-target --target=/target -- systemctl enable dcgm openibd
        - curtin in-target --target=/target -- update-rc.d nv_peer_mem defaults
    # DGX A100 …
        - curtin in-target -- mlnx_pxe_setup.bash
    

Drive Partitioning

storage:
   config:
   - id: disk-sda
     type: disk
     ptable: gpt
     path: /dev/sda
     name: osdisk
     wipe: superblock-recursive
   - id: partition-sda1
     type: partition
     device: disk-sda
     number: 1
     size: 512M
     flag: boot
     grub_device: true
   - id: partition-sda2
     type: partition
     device: disk-sda
     number: 2
     size: 100G
   - id: format-partition-sda1
     type: format
     fstype: fat32
     label: efi
     volume: partition-sda1
   - id: format-partition-sda2
     type: format
     fstype: ext4
     label: root
     volume: partition-sda2
   - id: root-mount
     type: mount
     path: /
     device: format-partition-sda2
     options: errors=remount-ro
     passno: 1
   - id: boot-mount
     type: mount
     path: /boot/efi
     device: format-partition-sda1
     passno: 1
   - id: disk-sdb
     type: disk
     ptable: gpt
     path: /dev/sdb
     name: raid
     wipe: superblock-recursive
   - id: partition-sdb1
     type: partition
     device: disk-sdb
     number: 1
   - id: format-partition-sdb1
     type: format
     fstype: ext4
     label: raid
     volume: partition-sdb1
   - id: raid-mount
     type: mount
     path: /raid
     device: format-partition-sdb1
     passno: 2