PXE Boot Setup

The DGX-Server UEFI BIOS supports PXE boot. Several manual customization steps are required to get PXE to boot the Base OS image.

Caution

This document is meant to be used as a reference. Explicit instructions are not given to configure the DHCP, FTP, and TFTP servers. It is expected that the end user’s IT team will configure them to fit within their company’s security guidelines.

Pre-requisites

  • TFTP server is setup
    • TFTP is configured to serve files from /local/tftp/

  • HTTP server is setup
    • HTTP is configured to serve files from /local/http/

  • DHCP server is setup

  • IP address is <FTP IP>

  • Fully qualified host is <FTP host>

This document is intended to provide detailed step-by-step instructions on how to set up a PXE boot environment for DGX systems. The examples are based on a DGX A100. There are several major components of the solution:

  • DHCP server: dnsmasq is used in this doc.

  • TFTP server: dnsmasq is also used as a TFTP server.

  • HTTP server: HTTP server is used to transfer large files, such is ISO image and initrd. Alternatively, FTP can also be used for this purpose. HTTP in used this doc.

  • Syslinux: Linux bootloader software package. section

Overview of the PXE Server

The PXE Server is divided up into three general areas:

  • Bootloader (grub)

  • TFTP contents (the kernel and initrd)

  • HTTP contents (the ISO image)

The rough directory structure on the TFTP and HTTP server will look like this:

/local/
   http/
      base_os_6.0.0/
         base_os_6.0.0.iso
   tftp/
      grub2/
         base_os_6.0.0/
            vmlinuz
            initrd
      grub.cfg
      bootx64.efi

The tftp-server (controlled by the xinetd service and configuration found in /etc/xinetd.d/tftp) points to the /local/tftp directory for when the system PXE boots. TFTP is what transfers the bootx64.efi file that is designated in the DHCP server’s dhcpd.conf file (see Configure your DHCP server). By default after the bootx64.efi is booted it looks for a grub2/grub.cfg file with the menu options for booting further. That config file will look for its kernel and initrd files relative to the tftp directory.

The following steps will assume the DHCP and PXE servers are configured to use the above directory structure. The lab admin, or whoever is in charge of deploying the PXE environment, should change the directory names and structure to fit their infrastructure.

Configuring the HTTP File Directory and ISO Image

Place a copy of the BaseOS 6.0.0 ISO in /local/http/base_os_6.0.0/

Mount the BaseOS 6.0.0 ISO

Assume your mount point is “/mnt”:

sudo mount -o loop /local/http/base_os_6.0.0/base_os_6.0.0.iso /mnt

Copy the kernel and initrd from the ISO to the TFTP Directory

cp /mnt/casper/vmlinuz /local/tftp/grub2/base_os_6.0.0/
cp /mnt/casper/initrd /local/tftp/grub2/base_os_6.0.0/

Configure the TFTP directory

Mount the BaseOS 6.0.0 ISO Assume your mount point is “/mnt”:

sudo mount -o loop /local/http/base_os_6.0.0/base_os_6.0.0.iso /mnt

Copy the kernel and initrd from the ISO to the TFTP Directory

cp /mnt/casper/vmlinuz /local/tftp/grub2/base_os_6.0.0/
cp /mnt/casper/initrd /local/tftp/grub2/base_os_6.0.0/

Download GRUB Packages For x86_64:

Download the relevant grub packages with the correct architecture specified:

wget http://mirror.centos.org/centos/7/updates/x86_64/Packages/grub2-efi-x64-2.02-0.87.el7.centos.7.x86_64.rpm
wget http://mirror.centos.org/centos/7/os/x86_64/Packages/shim-x64-15-8.el7.x86_64.rpm

Unpack the RPMs with the following commands:

rpm2cpio grub2-efi-x64-2.02-0.86.el7.centos.x86_64.rpm | cpio -idmv
rpm2cpio shim-x64-15-8.el7.x86_64.rpm | cpio -idmv

Copy the following binaries from the unpacked RPMs to /local/tftp/grub2/:

shim.efi
shimx64.efi
grubx64.efi

Make a copy of shimx64.efi in /local/tftp/grub2/ and name the copy bootx64.efi

For arm64:

Download the relevant grub package with the correct architecture specified:

wget http://mirror.centos.org/altarch/7/updates/aarch64/Packages/grub2-efi-aa64-2.02-0.87.el7.centos.7.aarch64.rpm

Unpack the RPMs with the following commands:

rpm2cpio grub2-efi-aa64-2.02-0.87.0.1.el7.centos.9.aarch64.rpm | cpio -idmv

Copy the following binary from the unpacked RPM to /local/tftp/grub2/:

grubaa64.efi

Create the grub configuration file

The contents of the /local/tftp/grub2/grub.cfg file should look something like:

set default=0
set timeout=-1
insmod all_video

menuentry 'Install BaseOS 6.0.0' {
  linuxefi /grub2/base_os_6.0.0/vmlinuz fsck.mode=skip autoinstall ip=dhcp url=http://<Server IP>/base_os_6.0.0/base_os_6.0.0.iso nvme-core.multipath=n nouveau.modeset=0
  initrdefi /grub2/base_os_6.0.0/initrd
}

Note

  • NOTE 1: The vmlinuz and initrd files are specified relative to the TFTP root - in this example, relative to /local/tftp/. The location of the ISO is relative to the HTTP root - in this example, /local/http/.

  • NOTE 2: The kernel boot parameters should match the contents of the corresponding ISO’s boot menu, found in /mnt/boot/grub/grub.cfg.

  • NOTE 3: In some cases, the transfer of the initrd can time out over FTP. A work around for this is to host the requisite files – initrd, vmlinuz, and the ISO – over HTTP instead. Hosting these over HTTP makes the transfer speedier and more reliable. In this example, we assume that the HTTP server is hosted from /local/http. We will need to copy these files to this location:

/local/
   http/
      base_os_6.0.0/
         base_os_6.0.0.iso
         vmlinuz
         initrd
   tftp/
      grub2/
         grub.cfg
         bootx64.efi
         grubaa64.efi

When configured this way, the grub.cfg file may look something like:

set default=0
set timeout=-1
insmod all_video

menuentry 'Install BaseOS 6.0.0' {
    linuxefi (http,<HTTP Server IP>)/base_os_6.0.0/vmlinuz fsck.mode=skip autoinstall ip=dhcp url=http://<Server IP>/base_os_6.0.0/base_os_6.0.0.iso nvme-core.multipath=n nouveau.modeset=0
    initrdefi (http,<HTTP Server IP>)/base_os_6.0.0/initrd
}

Useful parameters for configuring your system’s network interfaces:

ip=dhcp: tells the initramfs to automatically configure the system’s interfaces using DHCP. - If only one interface is connected to the network then this should be enough. - If multiple interfaces are connected to the network, then it will go with the first one that receives a reply.

Parameters unique to the Base OS installer

  • rebuild-raid tells the installer to rebuild the data RAID if specified. Installs from the factory should always specify this, but it is optional otherwise

  • md5checkdisc will not perform an installation when this is specified. It will simply unpack the ISO and check that its contents match with what’s described in md5sum.txt

  • offwhendone powers off the system after the installation. Otherwise, we reboot when done. Factory installs will specify this.

  • nooemconfig skip oemconfig and create default user “nvidia”, seeding initial password. Used for touchless install in PXE install or automatic VM creation/installation.

  • force-ai allows users to supply their own autoinstall file. If the networking is set up, then users can provide a URL. Otherwise, this has to be one that exists in the installer.

For example:

force-ai=/ai/dgx2-ai.yaml
force-ai=http://your-server.com/your-ai.yaml

Note

Refer to the note the Autoinstall Customizations section for special formatting considerations when using custom autoinstall files along with the force-ai parameter.

Configure DHCP

The DHCP server is responsible for providing the IP address of the TFTP server and the name of the bootloader file in addition to the usual role of providing dynamic IP addresses. The address of the TFTP server is specified in the DHCP configuration file as “next-server”, and the bootloader file is specified as “filename”. The architecture option can be used to detect the architecture of the client system and used to serve the correct version of the grub bootloader (x86, ia32, arm, etc).

An example of the PXE portion of dhcpd.conf is:

next-server <TFTP_Server_IP>;

# x86 UEFI
if option arch = 00:06 {
   filename "grub2/bootx64.efi";
# x64 UEFI
} else if option arch = 00:07 {
   filename "grub2/bootx64.efi";
# ARM 64-bit UEFI
} else if option arch = 00:0b {
   filename "grub2/grubaa64.efi";
} else {
   filename "pxelinux.0";
}

Optional: Configure CX-4/5/6/7 cards to PXE boot

DGX-Servers may also PXE boot using the MLNX CX-4/5/6 cards. If you are logged into the DGX-Server host OS, and running DGX Base OS 4.4 or later, then you can perform this section’s steps using the “/usr/sbin/mlnx_pxe_setup.bash” tool, which will enable the UEFI PXE ROM of every MLNX Infiniband device found.

Otherwise, proceed with the manual steps below.

Query UEFI PXE ROM state

In order to PXE boot from the MLNX CX-4/5/6/7 cards, you must first enable the UEFI PXE ROM of the card you wish to PXE boot from because it is disabled by default. This needs to be performed from the DGX Server host OS itself, it can’t be done remotely.

DGX OS 6 provides the in-tree OFED stack by default, but users may optionally install MOFED on top. The commands used to query and enable the UEFI PXE ROM will differ based on whether you are using the in-tree OFED vs. MOFED stack.

MOFED Instructions

To determine the device name and current configurations of the MLNX CX cards, run “sudo mlxconfig query”:

user@dgx1server$ sudo mlxconfig query

Device #1:
----------

Device type:    ConnectX4
Name:           MCX455A-ECA_Ax
Description:    ConnectX-4 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; ROHS R6
Device:         /dev/mst/mt4115_pciconf3

Configurations:                              Next Boot
         ...
         ...
         EXP_ROM_UEFI_x86_ENABLE             False(0)
         ...
         ...

In-tree OFED Instructions
To determine the device name and current configurations of the MLNX CX cards, run "sudo mstconfig query":

user@dgxserver:~$ sudo mstconfig query

Device #1:
----------

Device type:    ConnectX7
Name:           MCX755206AS-NEA_Ax
Description:    NVIDIA ConnectX-7 VPI adapter card; 400Gb/s IB and 200GbE; dual-port QSFP; PCIe 5.0 x16 with x16 PCIe extension option; dual slot; secure boot; no crypto; tall bracket for Nvidia DGX storage
Device:         /sys/bus/pci/devices/0000:b1:00.0/config

Configurations:                              Next Boot
         ...
         ...
         EXP_ROM_UEFI_x86_ENABLE             False(0)
         ...
         ...

Enable UEFI PXE ROM
The "EXP_ROM_UEFI_x86_ENABLE" configuration must be set to True(1) for the MLNX CX card that you wish to PXE boot from, and reboot.

MOFED Instructions

user@dgx1server$ sudo mlxconfig -y -d /dev/mst/mt4115_pciconf3 set EXP_ROM_UEFI_x86_ENABLE=1
user@dgx1server$ sudo reboot

Upon reboot, confirm the configuration was set.

user@dgx1server$ sudo mlxconfig query

Device #1:
----------

Device type:    ConnectX4
Name:           MCX455A-ECA_Ax
Description:    ConnectX-4 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; ROHS R6
Device:         /dev/mst/mt4115_pciconf3

Configurations:                              Next Boot
         ...
         ...
         EXP_ROM_UEFI_x86_ENABLE             True(1)
         ...
         ...
In-tree OFED Instructions
user@dgxserver:~$ sudo mstconfig -y -d b1:00.0 set EXP_ROM_UEFI_x86_ENABLE=1
user@dgx1server$ sudo reboot

Upon reboot, confirm the configuration was set.

user@dgxserver:~$ sudo mstconfig query

Device #1:
----------

Device type:    ConnectX7
Name:           MCX755206AS-NEA_Ax
Description:    NVIDIA ConnectX-7 VPI adapter card; 400Gb/s IB and 200GbE; dual-port QSFP; PCIe 5.0 x16 with x16 PCIe extension option; dual slot; secure boot; no crypto; tall bracket for Nvidia DGX storage
Device:         /sys/bus/pci/devices/0000:b1:00.0/config

Configurations:                              Next Boot
         ...
         ...
         EXP_ROM_UEFI_x86_ENABLE             True(1)
         ...
         ...

Optional: Configure the DGX-Server to PXE boot automatically

Add PXE to the top of the UEFI boot order

On systems that have a BMC, you can specify the DGX-Server to PXE boot by adding it to the top of the UEFI boot order. This may be done out-of-band via IPMI.

ipmitool -I lanplus -H <DGX_BMC_IP> -U <ADMIN> -P <PASSWORD> chassis bootdev pxe options=efiboot

Note

that this only sets the DGX-Server to PXE boot, but doesn’t specify the order of network devices to attempt PXE from. This is a limitation of our current UEFI and BMC FW. See the following section to specify the network device boot order.

Configure network boot priorities

The UEFI Network Drive BBS Priorities allows you to specify the order of network devices to PXE boot from. To modify this, you must reboot your DGX-Server and enter the UEFI boot selection menu by pressing “F2” or “Del” when you see the splash screen. Navigate to the “Boot” menu, and then scroll down to the “UEFI NETWORK Drive BBS Priorities”

_images/sbios-splash-screen-1.png

Configure the order of devices to attempt network boots from using this menu.

_images/sbios-splash-screen-2.png

Save and Exit.

Once you’ve finished ordering the network boot priorities, then save your changes and reset.

_images/sbios-splash-screen-3.png

Make the DGX-Server PXE boot

Automated PXE Boot Process

If you’ve followed the optional steps above, then you can simply reboot, and UEFI will attempt PXE boot using the devices in order specified in the Network Drive BBS Priorities list.

Manual PXE Boot Process

If you want to manually trigger the PXE boot, then reboot your DGX-Server and enter the UEFI boot selection menu by pressing “F2” or “Del” when you see the splash screen.

Navigate to the “Save & Exit” menu, scroll down to the Boot Override section, and choose the appropriate network port to boot from. The MLNX cards will only appear if you enabled the UEFI PXE ROM of that particular card.

_images/sbios-splash-screen-4.png

Alternatively, you can press “F12” at the SBIOS splash screen, and SBIOS will iterate thru each NIC and try PXE on each one. The order of the NICs attempted is specified by the Network Drive BBS Priorities.

Other IPMI boot options

For more information about specifying boot order via IPMI, see the “ipmitool” man page, and look at the “chassis” command, and “bootdev” subcommand: https://linux.die.net/man/1/ipmitool

For more information about the IPMI specification, refer to Intelligent Platform Management Interface Specification v2.0 rev. 1.1.

Autoinstall Customizations

The Base OS 6.x installer has undergone major changes compared to the Base OS 5.x installer. The Base OS 6.x installer now uses subiquity, which supports autoinstall instead of curtin.

Autoinstall and curtin both serve similar purposes but have some syntactic differences – be aware of these when porting old curtin files. There are many autoinstall files that users can reference inside the Base OS 6.x ISO; these are contained in:

casper/ubuntu-server-minimal.ubuntu-server.installer.kernel.nvidia.squashfs

Users can mount the ISO and then mount this squashfs to view the many autoinstall files that are packed within:

mkdir -p /tmp/iso_mnt
mkdir -p /tmp/squash_mnt
sudo mount /path/to/DGXOS-<version>-<date>.iso /tmp/iso_mnt/
sudo mount /tmp/iso_mnt/casper/ubuntu-server-minimal.ubuntu-server.installer.kernel.nvidia.squashfs /tmp/squash_mnt/
find /tmp/squash_mnt/ai/ -name '*.yaml'

For some deployments, users may want to use their own autoinstall files. This section will describe some sections contained in the built-in autoinstall files as well as how to perform some common customizations.

Note

The installer expects a unified autoinstall file rather than the typical split vendor/user/meta-data format. This means that the user-supplied autoinstall file will need to account for some formatting differences – namely, the autoinstall: keyword needs to be dropped and the indentations adjusted accordingly:

#
# typical user-data file
#
#cloud-config
autoinstall:
  version: 1
  identity:
    realname: 'DGX User'
    username: dgxuser
    password: '$6$g3vXaGj.MQpP/inN$l6.JtAueRAfMtQweK7qASjxXiEX8Vue3CvRcwON81Rt9BJmlEQKtnfOVSnCqHrTsy88PbMDDHq6k.iM6PWfHr1'

#
# unified autoinstall file
#
version: 1
identity:
  realname: 'DGX User'
  username: dgxuser
  password: '$6$g3vXaGj.MQpP/inN$l6.JtAueRAfMtQweK7qASjxXiEX8Vue3CvRcwON81Rt9BJmlEQKtnfOVSnCqHrTsy88PbMDDHq6k.iM6PWfHr1'

NVIDIA-Specific Autoinstall Variables

The autoinstall files contained in the ISO are platform-specific, and serve as a good starting point for custom versions. Many of them contain variables, prefixed with CHANGE_ which will be substituted by the installer:

  • CHANGE_STORAGE_REG This gets removed and uncommented when the boot parameter ai-encrypt-root is not present. Uncommenting this stanza results in the standard disk partitioning scheme without LUKS encryption.

  • CHANGE_STORAGE_ENC This gets removed and uncommented when the boot parameter “ai-encrypt-root” is present. Uncommenting this stanza results in an encrypted root partition.

  • CHANGE_BOOT_DISK_NAME_x This is a disk-name, without the “/dev” prefix. There may be multiple ones (e.g. CHANGE_BOOT_DISK_NAME_1 and CHANGE_BOOT_DISK_NAME_2) for platforms that expect a RAIDed boot device as – is the case for DGX-2 and DGX A100.

    Note

    The installer will find the appropriate disk name to substitute here. Alternatively, the “force-bootdisk” parameter can be used to specify the disk name(s).

  • CHANGE_BOOT_DISK_PATH_x This is the same as the CHANGE_BOOT_DISK_NAME_x variable above, except that it is prefixed with “/dev/”.

  • CHANGE_DESC_PLATFORM The installer will substitute this with a platform-specific descriptive name.

  • CHANGE_SERIAL_NUMBER The installer will substitute this with the serial number reported by dmidecode.

  • CHANGE_INSTALL_PKGS The installer will substitute this value with a list of packages specific to the platform. The lists of packages are specified by the *-pkgs files in the squashfs

  • CHANGE_REBUILD_RAID This gets replaced with either “true” or “false” based on whether or not the “rebuild-raid” boot parameter is present.

  • CHANGE_IPMISOL This gets replaced with either “true” or “false” based on whether or not the “ai-encrypt-root” boot parameter is present. When we set the system up with encryption, we also undo the IPMI serial-over-LAN configuration to ensure that the LUKS passphrase prompt shows up on the console rather than the serial-over-LAN interface.

Attention

While it is possible to replace these values on your own, we strongly recommend letting the installer handle this.

Common Customizations

In this section, we will describe some common customizations that may be useful in more custom deployments.

Network Configuration

To configure the network at install time, you can add a “network” section to your autoinstall file. In this example we will create a netplan configuration file that sets the enp1s0f0 interface to use DHCP:

network:
 version: 2
 ethernets:
   enp1s0f0:
    dhcp4: yes

Creating a User

To create a user at install time, you can add an “identity” section to your autoinstall file. In this example, we set the system’s hostname to “dgx” and create a user with the name/password of nvidia/nvidia.

#  To generate an encrypted password:
#    printf '<plaintext_password>' | openssl passwd -6 -stdin
#
#  For example:
#    printf 'nvidia' | openssl passwd -6 -stdin
identity:
  hostname: dgx
  password: $6$8fqF54QDoaLMtDXJ$J02iNH1xW9hHtzH6APpUX4X4HkRx2xY2ZKy9DQpGOQhW7OOuTk3DwHr9FnAAh1JIyqn3L277Jy9MEzW4MyVsV0
  username: nvidia

There are many more examples documented in the Ubuntu autoinstall reference