Using Containers

Containers provide a way to encapsulate all the software dependencies of an application and enable it to be deployed on different systems. Containers are the preferred way to run applications on the DGX SuperPOD.

The DGX SuperPOD is deployed with two tools, Pyxis and Enroot, to help simplify the secure use of containers on the DGX SuperPOD. Pyxis extends the functionality of Slurm so that jobs can be launched directly into a container with srun. Enroot is a light-weight container-runtime that enables traditional container images to be run in unprivileged mode.

Examples

Here are some example commands for working with user containers:

  • Submit a job to Slurm on a worker node.

    1srun grep PRETTY /etc/os-release
    2PRETTY_NAME="Ubuntu 20.04.4 LTS"
    
  • Submit a job to Slurm and launching it in a container.

    The –container-image option is used to specify which container to use.

    1srun --container-image=centos grep PRETTY /etc/os-release
    2PRETTY_NAME="CentOS Linux 7 (Core)"
    
  • Mount a file from the host and run the command on it from inside the container.

    1srun --container-image=nvcr.io/nvidia/pytorch:22.12-py3 --container-mounts=/etc/os-release:/host/os-release grep PRETTY /host/os-release
    2pyxis: importing docker image: nvcr.io/nvidia/pytorch:22.12-py3
    3pyxis: imported docker image: nvcr.io/nvidia/pytorch:22.12-py3
    4PRETTY_NAME="Ubuntu 20.04.4 LTS"
    
  • The –container-mounts option can be used to mount both files and directories into the container environment. Multiple options should be separated by commas.

    1srun -N 2 --ntasks-per-node=1 --container-image=nvcr.io/nvidia/pytorch:22.12-py3 --container-mounts=/etc/os-release:/host/os-release grep PRETTY /host/os-release
    2pyxis: imported docker image: nvcr.io/nvidia/pytorch:22.12-py3
    3pyxis: imported docker image: nvcr.io/nvidia/pytorch:22.12-py3
    
  • Submit the same command across two nodes, mounting the current directory as /work in the container.

    The full network name of the container is different. Enroot requires the separator between the network repository name (nvcr.io in this case) to be separated by a #, not a slash (/).

    1srun -N 2 --ntasks-per-node=1 \
    2--container-image=nvcr.io/nvidia/pytorch:22.12-py3 --container-mounts=$(pwd):/work \
    3/bin/bash -c 'uname -n && cat /etc/os-release | grep PRETTY_NAME'
    4dgx1
    5PRETTY_NAME="Ubuntu 20.04.5 LTS"
    6dgx2
    7PRETTY_NAME="Ubuntu 20.04.5 LTS"
    

Further resources are available at these links:

  • For a tutorial on running a multi-node Pyxis/Enroot BERT container, see this guide.

  • For a hello world tutorial on using MPI to run multi-gpu and multi-node jobs, see this guide.

  • For a tutorial on running a multi-node machine learning job using Dask on Slurm, see this guide.