Power On and Provision the Cloud Nodes

Now that the required post-installation configuration has been completed, it is time to power on and provision the public cloud nodes. Public cloud node behavior is slightly different from on-premises equipment—the systems will not be provisioned in the target public cloud until they are first powered on. Additionally, the director node must be powered on and provisioned first—until it is fully provisioned, it is not possible to deploy the public cloud nodes it manages in a region. Just as with on-premises deployments, the public cloud nodes can be accessed through ssh during the installation process.

Watch the /var/log/messages and /var/log/node-installer log files to verify that everything is proceeding smoothly if you are unsure of a given node’s deployment state.

  1. Power on the cloud director.

    It will enter a [ PENDING ] state, then transition to [ DOWN ] (Instance has started).

    1cmsh
    2power on us-west-2-director
    

    The provisioning of the cloud director may take two or more hours due to the tens of gigabytes of software image data that must be synchronized to the public cloud. The process is complete when the cloud director moves to an [ UP ] state.

  2. Power on the four public cloud nodes concurrently.

    After the cloud director is fully provisioned, bringing up the other four public cloud nodes is much faster because their base images are already stored in the target region with the cloud director.

    % power on -n us-west-2-knode00[1-3],us-west-2-gpu-node001
    
  3. Run device then list to ensure all public cloud nodes are in an [ UP ] state.

    _images/cloud-node-01.png

    Disregard any trailing Status output.

  4. Install the NVIDIA driver on us-west-2-gpu-node001.

    ssh to it as root and run all subsequent commands from the node in AWS.

    1ssh us-west-2-gpu-node001
    2apt install linux-headers-$(uname -r)
    3distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e ‘s/\.//g’)
    4wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
    5dpkg -I cuda-keyring_1.0-1_all.deb
    6apt update
    7apt install -y cuda-drivers –no-install-recommends
    8rm cuda-keyring_1.0-1_all.deb
    9nvidia-smi
    
  5. Look for output from nvidia-smi, which like this, shows a successful installation.

    Expect possible variations in software versions and device utilization.

     1+-----------------------------------------------------------------------------+
     2| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
     3|-------------------------------+----------------------+----------------------+
     4| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     5| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     6|                               |                      |               MIG M. |
     7|===============================+======================+======================|
     8|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
     9| N/A   36C    P8    15W /  70W |      2MiB / 15360MiB |      0%      Default |
    10|                               |                      |                  N/A |
    11+-------------------------------+----------------------+----------------------+
    12
    13+-----------------------------------------------------------------------------+
    14| Processes:                                                                  |
    15|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    16|        ID   ID                                                   Usage      |
    17|=============================================================================|
    18|  No running processes found                                                 |
    19+-----------------------------------------------------------------------------+
    
  6. Log out of the public cloud GPU node and back into the on-premises head node.

  7. Execute the following to capture the modifications made to the public cloud GPU node, which will then be present in the image of any additional public cloud GPU nodes provisioned in this environment.

    1cmsh
    2device
    3use us-west-2-gpu-node001
    4grabimage -w