Software Image Setup#

Each category/node type that is provisioned must have a software image created for each node category, so that each node type can get the customizations they need.

For the control plane nodes, each image is a clone of the default image for their respective architecture. For example, ARM/aarch64 based control plane nodes clone the default-image on an ARM/aarch64 based head node. Conversely, for an x86 based node, its image is cloned from the default image that was imported in the Mixed Architecture Setup section; default-image-ubuntu2404-x86_64.

  1. For DGX GB200 nodes, a DGX OS 7 image needs to be imported or created. If a BCM11-dgx.iso is used, it will already included the image.

  2. After the images are successfully created, cm-chroot-sw-image can be used to custom configure the control nodes per their individual requirements.

Example: Using cm-chroot-sw-img to update/modify software images

cm-chroot-sw-image /cm/images/<image name>

Example: Example with k8s-user-node image

cm-chroot-sw-img /cm/images/k8s-user-image
mounted /cm/images/k8s-user-image/dev
mounted /cm/images/k8s-user-image/dev/pts
mounted /cm/images/k8s-user-image/proc
mounted /cm/images/k8s-user-image/sys
mounted /cm/images/k8s-user-image/run
mounted /run/systemd/resolve/stub-resolv.conf -> /cm/images/k8s-user-image/run/systemd/resolve/resolv.conf

Using chroot with mounted virtual filesystems to chroot in /cm/images/k8s-user-image....

Type 'exit' or ctrl-D to exit from the chroot in the software image.

This also unmounts the above mentioned /dev /dev/pts /proc /sys /run filesystems in the software image.

root@k8s-user-image:/# apt update
root@k8s-user-image:/# history
1 apt update
2 apt install cmdaemon
3 dpkg -l | grep cmdaemon

   # <hit Ctrl+D to exit>

If there are scripts or any files that need to be added to the image so that they appear then they are provisioned to a node, copy them to:

/cm/images/<node image>/<regular linux file directory>

Reference: Adding bonding module to an image

cmsh -c "softwareimage; use default-image; kernelmodules; add bonding;commit"

Example: Setting the kernel parameters for DGX GB200 nodes.

cmsh -c "softwareimage; use <software image name>; set kernelparameters
"nouveau.modeset=0 iommu.passthrough=1
systemd.unified_cgroup_hierarchy=0
systemd.legacy_systemd_cgroup_controller init_on_alloc=0
numa_balancing=disable acpi_power_meter.force_map_on=y"; commit"

Example: Disabling multipathing on NVMe devices

Check current kernel parameters

cmsh; softwareimage; use <software image name>; get kernelparameters

If the multipath argument is not present, append the kernel parameters and verify

cmsh; softwareimage; use <software image name>; append kernelparameters " nvme_core.multipath=n"; commit

Result

cmsh; softwareimage; use <software image name>;get kernelparameters

rd.driver.blacklist=nouveau nvme_core.multipath=n

Note

If it is not the first entry, ensure there is a space before appending a kernel parameter.

Control Plane Software Image Setup#

For NVIDIA Mission Control 2.0, the control nodes that need to be provisioned are:
  • slogin

  • k8s-user

  • k8s-admin

Depending on the microarchitecture of the head node and the control node, the following generic guidance should be followed:

Example: Generic software image clone (Head node and control node of the same microarchitecture)

cmsh -c "softwareimage; use default-image; clone default-image <control node type>-image; commit"

Example: Generic software image clone (ARM/aarch64 head node with x86 control node)

cmsh -c "softwareimage; use default-image-ubuntu2404-x86_64; clone default-image-ubuntu2404-x86_64 <control node type>-image; commit"

Example: Generic software image clone (x86 head node with ARM/aarch64 control node)

cmsh -c "softwareimage; use default-image-ubuntu2404-aarch64; clone default-image-ubuntu2404-aarch64 <control node type>-image; commit"

For DGX SuperPOD Reference Architectures (RA), do the following:

Slurm Login (slogin)#

The slogin node(s) are aarch64/ARM based in the RA. They can be either/or in the field. The following example assumes the head node and the slogin node are both on aarch64/ARM microarchitecture.

cmsh -c "softwareimage; use default-image; clone default-image slogin-image; commit"

K8s-admin#

In the NMC 2.0, the k8s-admin control plane nodes are x86 only since NMX-M is only supported on x86. An x86 vanilla image had to be created to do this step (since the head node in the RA is C2/ARM based). This is covered in the mixed architecture setup instructions. The resultant image is applied to all k8s-admin nodes that will be configured to that specific category.

cmsh -c "softwareimage; use default-image-ubuntu2404-x86_64; clone default-image-ubuntu2404-x86_64 k8s-admin-image; commit"

K8s-user#

In the NMC 2.0, the K8s user plane nodes can be either x86 or aarch64/ARM, since Run.ai can be installed on either microarchitecture. The following example assumes the head node and the k8s-user-node are both on aarch64/ARM microarchitecture.

cmsh -c "softwareimage; use default-image; clone default-image k8s-user-image; commit"

DGX GB200 Software Image Setup#

For the DGX OS software image setup, if the DGX OS 7 image is not directly included in the ISO it can be created from the packages in the ISO.

  1. Download the BCM ISO for ARM64, as described earlier in this document.

  2. Create the image from the ISO using the cm-create-image tool.

cm-create-image --cmdvd /root/bcm-11.0-ubuntu2404-dgx-os-7.2_arm64_RC3.iso \
--no-cm-cuda-repo --extra-pkg-group doca_ofed_2.10.0-093520 --dgx \
--dgx-type dgx_gb200 --imagename dgxos-7.2-image

Reference: Results of DGX OS 7 image creation

Running validate base tar........................ [  OK  ]
Running sanity check............................. [  OK  ]
Running unpack base tar.......................... [  OK  ]
******************** IMPORTANT ****************************
Please confirm that the base distribution repositories for
the software image are enabled. For instructions on how to
enable repositories for your software image, please refer
the administrator's manual.


Image creation can be resumed in one of the following ways:
-----------------------------------------------------------
1. Enter 'e' to exit, and configure repositories.
   Then, restart program with the -d (--fromdir) option.
   cm-create-image -d /cm/images/dgxos-7.2-image -n dgxos-7.2-image

2. Open a new console, and configure repositories.
   Then enter 'c' on this console, to continue software
   image creation.

***********************************************************

Continue(c)/Exit(e)? c


Finalize base distribution....................... [ OK ]
Copying cm repo files............................ [ OK ]
Validating repo configuration.................... [ OK ]
Installing distribution packages................. [ OK ]
Finalizing image services........................ [ OK ]
Installing CM packages........................... [ OK ]
Finalizing cluster services...................... [ OK ]
Copying cluster certificate to image............. [ OK ]
Adding/Updating software image................... [ OK ]

Software image summary#

When complete, the available software images should resemble the following example:

cmsh -c "softwareimage;list"
Name (key)                        Path (key)                                   Kernel version           Nodes
---------------------------       -------------------------------------------  --------------------     -----
k8s-admin-image                   /cm/images/k8s-admin-image                  6.8.0-51-generic            3
baseos7-image                     /cm/images/baseos7-image                    6.8.0-1021-nvidia-64k       0
default-image                     /cm/images/default-image                    6.8.0-51-generic-64k        0
default-image-ubuntu2404-x86_64   /cm/images/default-image-ubuntu2404-x86_64  6.8.0-51-generic            0
k8s-user-image                    /cm/images/k8s-user-image                   6.8.0-51-generic-64k        3
slogin-image                      /cm/images/slogin-image                     6.8.0-51-generic-64k        2