Power On and Provision the Cloud Nodes
Now that the required post-installation configuration has been completed, it is time to power on and provision the public cloud nodes. Public cloud node behavior is slightly different from on-premises equipment—the systems will not be provisioned in the target public cloud until they are first powered on. Additionally, the director node must be powered on and provisioned first—until it is fully provisioned, it is not possible to deploy the public cloud nodes it manages in a region. Just as with on-premises deployments, the public cloud nodes can be accessed through ssh during the installation process.
Watch the /var/log/messages and /var/log/node-installer log files to verify that everything is proceeding smoothly if you are unsure of a given node’s deployment state.
Power on the cloud director.
It will enter a [ PENDING ] state, then transition to [ DOWN ] (Instance has started).
1cmsh 2power on us-west-2-director
The provisioning of the cloud director may take two or more hours due to the tens of gigabytes of software image data that must be synchronized to the public cloud. The process is complete when the cloud director moves to an [ UP ] state.
Power on the four public cloud nodes concurrently.
After the cloud director is fully provisioned, bringing up the other four public cloud nodes is much faster because their base images are already stored in the target region with the cloud director.
% power on -n us-west-2-knode00[1-3],us-west-2-gpu-node001
Run device then list to ensure all public cloud nodes are in an [ UP ] state.
Disregard any trailing
Status
output.Install the NVIDIA driver on us-west-2-gpu-node001.
ssh
to it as root and run all subsequent commands from the node in AWS.1ssh us-west-2-gpu-node001 2apt install linux-headers-$(uname -r) 3distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e ‘s/\.//g’) 4wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb 5dpkg -I cuda-keyring_1.0-1_all.deb 6apt update 7apt install -y cuda-drivers –no-install-recommends 8rm cuda-keyring_1.0-1_all.deb 9nvidia-smi
Look for output from
nvidia-smi
, which like this, shows a successful installation.Expect possible variations in software versions and device utilization.
1+-----------------------------------------------------------------------------+ 2| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | 3|-------------------------------+----------------------+----------------------+ 4| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 5| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 6| | | MIG M. | 7|===============================+======================+======================| 8| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | 9| N/A 36C P8 15W / 70W | 2MiB / 15360MiB | 0% Default | 10| | | N/A | 11+-------------------------------+----------------------+----------------------+ 12 13+-----------------------------------------------------------------------------+ 14| Processes: | 15| GPU GI CI PID Type Process name GPU Memory | 16| ID ID Usage | 17|=============================================================================| 18| No running processes found | 19+-----------------------------------------------------------------------------+
Log out of the public cloud GPU node and back into the on-premises head node.
Execute the following to capture the modifications made to the public cloud GPU node, which will then be present in the image of any additional public cloud GPU nodes provisioned in this environment.
1cmsh 2device 3use us-west-2-gpu-node001 4grabimage -w