Using Containers#

Containers provide a way to encapsulate all the software dependencies of an application and enable it to be deployed on different systems. Containers are the preferred way to run applications on the DGX SuperPOD.

The DGX SuperPOD is deployed with two tools, Pyxis and Enroot, to help simplify the secure use of containers on the DGX SuperPOD. Pyxis extends the functionality of Slurm so that jobs can be launched directly into a container with srun. Enroot is a light-weight container-runtime that enables traditional container images to be run in unprivileged mode.

Examples#

Here are some example commands for working with user containers:

  • Submit a job to Slurm on a worker node.

srun grep PRETTY /etc/os-release

PRETTY_NAME="Ubuntu 20.04.4 LTS"
  • Submit a job to Slurm and launch it in a container.

The –container-image option is used to specify which container to use.

srun --container-image=centos grep PRETTY /etc/os-release

PRETTY_NAME="CentOS Linux 7 (Core)"
  • Mount a file from the host and run the command on it from inside the container.

srun --container-image=nvcr.io/nvidia/pytorch:22.12-py3
--container-mounts=/etc/os-release:/host/os-release grep PRETTY
/host/os-release

pyxis: importing docker image: nvcr.io/nvidia/pytorch:22.12-py3

pyxis: imported docker image: nvcr.io/nvidia/pytorch:22.12-py3

PRETTY_NAME="Ubuntu 20.04.4 LTS"
  • The –container-mounts option can be used to mount both files and directories into the container environment.

Multiple options should be separated by commas.

srun -N 2 --ntasks-per-node=1
--container-image=nvcr.io/nvidia/pytorch:22.12-py3
--container-mounts=/etc/os-release:/host/os-release grep PRETTY
/host/os-release

pyxis: imported docker image: nvcr.io/nvidia/pytorch:22.12-py3

pyxis: imported docker image: nvcr.io/nvidia/pytorch:22.12-py3
  • Submit the same command across two nodes, mounting the current directory as /work in the container.

The full network name of the container is different. Enroot requires the separator between the network repository name (nvcr.io in this case) to be separated by a #, not a slash (/).

srun -N 2 --ntasks-per-node=1 \\

--container-image=nvcr.io/nvidia/pytorch:22.12-py3
--container-mounts=$(pwd):/work \\

/bin/bash -c 'uname -n && cat /etc/os-release \| grep PRETTY_NAME'

dgx1

PRETTY_NAME="Ubuntu 20.04.5 LTS"

dgx2

PRETTY_NAME="Ubuntu 20.04.5 LTS"

Further resources are available at these links: