Clara Train - Getting started with a Cloud Service Provider

Researchers who do not have access to a local GPU workstation or server can easily get started with the Clara Train SDK using a GPU-enabled instance from a Cloud Service Provider (CSP). The following services have been validated; other CSPs may be used with similar configurations:

Amazon Web Services (AWS)
Google Cloud Platform Services (GCP)
Microsoft Azure Cloud Services (Azure)

Recommended system and instance configuration

Hardware:

GPU: V100 16GB or 32GB
~8 vCPU w/ AVX
64GB memory
100GB storage
AWS p3.2xlarge
GCP n1-standard-16 + nvidia-tesla-v100
Azure Standard_NC6s_v3

OS and software stack:

Ubuntu LTS or similar, Docker, nvidia-container-runtime
AWS: Deep Learning Base AMI (Ubuntu 18.04)
GCP: NVIDIA GPU-Optimized Image for Deep Learning, ML & HPC
Azure: NVIDIA GPU-Optimized Image for AI & HPC

Network and security:

Ports
- SSH - 22
- JupyterLab - 8888 or 8890 as in the following examples
- AIAA - 5000
- HTTP/HTTPS
SSH authentication via shared key identity

CSP-specific Configuration

AWS

An overview of NGC on AWS can be found in the NGC Docs at https://docs.nvidia.com/ngc/ngc-deploy-public-cloud/ngc-aws/index.html.

GCP

An overview of NGC on GCP can be found in the NGC Docs at https://docs.nvidia.com/ngc/ngc-deploy-public-cloud/ngc-gcp/index.html.

To use an external SSH client, enable OS Login by editing the VM instance configuration to add:

the enable-oslogin, TRUE metadata key, value pair
your public SSH key
- It may be necessary to use the web-based SSH console to manually add your public key in ~/.ssh/authorized_keys
- When the ssh key is added manually, you will also have to enable sudo access using visudo and adding: <username> ALL=(ALL) NOPASSWD:ALL

To open the ports necessary for Clara Train AIAA and JupyterLab, create a VPC Firewall rule allowing access on ports 5000 and 8890. This can be done in two steps:

edit the VM instance details to add a network label
under network interfaces click to view details, then use VPC Network > Firewall to create a firewall rule for ports 5000 and 8890 targeting the network label added above

Some examples in the Clara Train Getting Started notebooks require docker-compose. This can be installed following the directions here: https://docs.docker.com/compose/install/

Note

GCP Ubuntu instances set a default umask of 0077, which can cause issues for users mounting directories in the Clara Train docker container. This can be changed by setting:

umask 0002

To make this persistent, add umask 0002 to your user ~/.bashrc configuration. If there are existing files with restrictive permissions, you may need to run:

chmod -R a+rX ~/*

If there are files with root ownership, e.g., from running other containers, you may need sudo in the above chmod command.

Azure

An overview of NGC on Azure can be found in the NGC Docs at https://docs.nvidia.com/ngc/ngc-deploy-public-cloud/ngc-azure/index.html.

Note

Azure Ubuntu instances set a default umask of 0077, which can cause issues for users mounting directories in the Clara Train docker container. This can be changed by setting:

umask 0002

To make this persistent, add umask 0002 to your user ~/.bashrc configuration. If there are existing files with restrictive permissions, you may need to run:

chmod -R a+rX ~/*

If there are files with root ownership, e.g., from running other containers, you may need sudo in the above chmod command.

The default Azure OS disk does not provide enough free space for the Docker data store, so we will instead use the large scratch disk provided with the instance. In a production scenario, a more permanent disk should be used. To use this scratch space, first create a directory:

Copy
Copied!

            
            sudo mkdir /mnt/docker

And then edit /etc/docker/daemon.json to include the data-root directory:

Copy
Copied!

            
            {
    "data-root": "/mnt/docker",
}

Getting Started with Clara Train Jupyter Notebooks

Users new to the Clara Train SDK will benefit from the Jupyter Notebooks contained in the Notebooks for Clara Train SDK collection, which walks through the basic user environment and provides scripts to pull and run the Docker image. This collection will be used for the purpose of demonstration here. Users already experienced with the Clara Train SDK may start directly with NVIDIA Clara Train SDK on NGC.

To get started with the Clara Train Examples Notebooks, ssh to the CSP instance, clone the github repository, and start the Clara Train Docker container:

Copy
Copied!

            
            ssh -i /path/to/identity.pem ubuntu@<IP of instance>
# now on the cloud instance
git clone https://github.com/NVIDIA/clara-train-examples.git
cd clara-train-examples/PyTorch-Early-Access/NoteBooks/scripts && chmod a+x *.sh
./startDocker.sh 8890 '0' 5000

To access the services that will run in the Clara Train container, we will need the associated ports opened up to the cloud instance. Using the examples in the Intro collection, we will need to open port 8890 for JupyterLab and port 5000 for AIAA. These ports can be opened in the security settings for the cloud instance, for example in AWS Security Groups. Another option is using SSH to forward these ports to your local machine:

Copy
Copied!

            
            ssh -N -L localhost:8890:localhost:8890 \
  -L localhost:5000:localhost:5000 \
  <IP address of Cloud instance>

The PyTorch-Early-Access/NoteBooks/readMe.md provides an overview of the configuration and the basic steps required to get started. Briefly, in this example, the ./startDocker.sh 8890 '0' 5000 command above starts the container with JupyterLab configured to use port 8890, using GPU ‘0’, with AIAA services on port 5000. Once working in the running container, the JupyterLab services are installed and started by running:

Copy
Copied!

            
            installDashBoardInDocker.sh

This will produce output during installation of dependencies and finish with a message similar to the following:

Copy
Copied!

            
            To access the notebook, open this file in a browser:
    file:///root/.local/share/jupyter/runtime/nbserver-613-open.html
Or copy and paste one of these URLs:
    http://hostname:8888/?token=43dcac32bc4b939084db6cdaa047ef8b9771c97b8455c627

Note that the preceding output displays the default JupyterLab port 8888. We have configured the container to expose this externally on port 8890. With ports opened in the instance configuration or forwarded via SSH, you can then access JupyterLab by navigating to either http://<instance IP address>:8890 or http://localhost:8890, respectively. You will then be prompted in the JupyterLab interface for the token provided in the preceding output. After successfully authenticating with this token, you can use the Welcome.ipynb notebook to get started with the Clara Train SDK.