IT Administrator#
When an AI Practitioner or a Software Engineer requests a VM for experimentation or deployment, the IT Administrator creates a new VM from a VM template and serves it to the AI Practitioner. A VM template is a master copy image of a virtual machine that includes VM disks, virtual devices, and software settings. Templates save time and avoid errors when configuring settings for AI workflows. They also ensure that VMs are consistent and standardized when they are created and deployed across the Enterprise.
This section covers creating VM templates from scratch with required NVIDIA AI Enterprise components to perform AI Training and deploy inferencing using Triton. The following graphic illustrates the workflow performed by the IT Administrator.
IT Administrators can follow this four-step process to serve AI-ready VMs. Within this guide, detailed steps are provided to understand workflow processes clearly.
Before continuing with this guide, ensure the following server requirements are met:
Minimum Server Requirements
At least one NVIDIA data center GPU in a single NVIDIA-Certified server. Recommended A100 for training and A30 for inference.
VMware vSphere Hypervisor (ESXi) Enterprise Plus Edition 7.0 Update 2
VMware vCenter Server 7.0 Update 2
NVIDIA AI Enterprise Host Software and Guest Driver Software 12.0 or higher with NVIDIA AI Enterprise licenses
NVIDIA AI Enterprise License System
Creating an Ubuntu 20.04 Virtual Machine#
Recommended VM configuration is as follows for both AI Training and Inference use cases:
Virtual Machine Configuration |
|
---|---|
Boot |
Configured for EFI |
OS |
Ubuntu Server 20.04 Server HWE 64-bit |
CPU |
16 vCPU on a single socket |
RAM |
64GB |
Storage |
150GB thin provisioned disk |
Network |
VMXNet3 NIC |
GPU |
A100-40C (As an example) |
To proceed with this guide, create a VM with the above hardware configuration.
Please refer to the Creating Your First NVIDIA AI Enterprise VM section with the NV AI Enterprise for VMware vSphere Deployment Guide. This guide provides detailed steps for the following requirements:
GPU partitions can be a valid option for executing Deep Learning workloads for Ampere based GPUs. An example is Deep Learning training workflows, which utilize smaller sentence sizes, smaller models, or batch sizes. Inferencing workloads typically don’t require as much GPU memory as training workflows, and the model is generally quantized to run at a lower memory footprint (INT8 and FP16). vGPU with MIG partitioning, allows for a single GPU to be sliced up to seven accelerators. These partitions can then be leveraged by up to seven different VMs, bringing optimal GPU utilization and VM density. To turn MIG on or off on the server, please refer to the Advanced GPU Configuration section of NV AI Enterprise for VMware vSphere Deployment Guide.
Using MIG partitions for Triton Inference server deployments provides a better ROI for many organizations. Therefore, when using this guide, the VM can be assigned a fractional MIG profile such as A100-3-20C. Additional information on MIG is located here.
Install NVIDIA Driver, Docker, and NVIDIA Container Toolkit#
After the VM is created, perform the following in the VM:
Install Docker
Install NVIDIA Container Toolkit
Additional Application Configuration#
Once the above VM prerequisites are met, the VM needs to be further configured to either execute AI Training and deploying Triton Inference Server. The following sections describe additional application-specific configurations which are necessary, as well as the required docker container pull for the VM. The next steps are outlined below and will be executed inside the VM:
Create a directory to hold the dataset.
Pull the appropriate docker container from NVIDI NGC Catalog.
AutoStart application-specific services.
Configuring the VM for BERT Model Training and Inference#
Since AI Practitioners will leverage this VM for AI training, a TensorFlow and Triton Inference Server container are pulled from the NVIDIA NGC Catalog. This section contains detailed steps for pulling a BERT container built on top of the TensorFlow container. We will also create a dataset folder inside the home directory of the VM and set up a systemd process to restart a Jupyter notebook upon cloning the VM and reboot. This will ensure the AI Practitioner can quickly leverage this VM since the Jupyter notebook server will be up and running.
Execute the following workflow steps within the VM in order to pull the containers.
Generate or use an existing API key.
Access the NVIDIA |nc|.
Create a
triton
directory inside the VM for the AI Practitioner to host the model.mkdir ~/triton
Pull the appropriate NVIDIA AI Enterprise containers.
Important
You will need access to NVIDIA NGC in order to pull the docker files called out below.
sudo docker pull nvcr.io/nvaie/tensorflow-<NVAIE-MAJOR-VERSION>:<NVAIE-CONTAINER-TAG>
sudo docker pull nvcr.io/nvaie/tritonserver-<NVAIE-MAJOR-VERSION>:<NVAIE-CONTAINER-TAG>
Note
For most of the AI Training use cases, the TensorFlow base container is sufficient, but since we are going to use an NVIDIA pre-trained model for creating a custom Conversational AI model which will be further trained on your data, we need these additional libraries. So we will build a container with extra libraries on top of the NVIDIA AI Enterprise container.
Clone the directory below.
git clone https://github.com/NVIDIA/DeepLearningExamples.git
Change to the directory.
cd DeepLearningExamples/TensorFlow/LanguageModeling/BERT
Finally build the custom docker container.
docker build -t bert_container .
Create a script to run the TensorFlow and Triton Inference Server automatically on a template clone or VM upon reboot.
touch triton-starup.sh
Create a script to run TensorFlow automatically on template clone or VM reboot.
vim ~/triton-startup.sh
Add the following contents to the file.
1#!/bin/bash 2docker run -d --gpus=all -v /home/temp/triton:/triton --net=host bert_container jupyter-notebook --ip='0.0.0.0' --NotebookApp.token='' --NotebookApp.base_url='/notebook/'span class="s1">''
Make the script executable.
chmod +x ~/triton-startup.sh
Create a systemd process for auto startup
sudo vim /etc/systemd/system/jupyter.service
Add the following content to the service file.
1[Unit] 2Description=Starts Jupyter server 3 4[Service] 5ExecStart=/home/nvidia/triton-startup.sh #use your home path 6 7[Install] 8WantedBy=multi-user.target
Start and enable the service on reboot.
1sudo systemctl start jupyter.service 2sudo systemctl enable jupyter.service
Create a template from the VM#
Now that the VM has been appropriately configured for AI Training and deploying Inference, the final workflow for the IT Administrator is to create a VM Template that can be used to deploy VMs in the future rapidly. The IT Administrator creates a template for the VM and then clones the templates to serve multiple AI Practitioners/Engineers. We will create a template for the VM for this guide, but organizations may also choose to create templates using an OVF file.
Create a Guest Customization Specification#
Guest customization specifications can be created in vCenter; these specifications for system settings are essentially XML files that contain guest operating system settings for virtual machines. When you apply a specification to the guest operating system during virtual machine cloning or deployment, you prevent conflicts that might result in deploying virtual machines with identical settings, such as duplicate DNS computer names.
Follow the VMware Doc to create a customization spec for Linux.
Create the Virtual Machine Template#
In the vCenter, right-click on the newly created VM -> select Clone -> select “Clone to template”.
Add name folder -> Select the compute resource -> add storage -> select the guest customization spec that you created -> Click on finish.
Enterprises may have both IT Admins as well as DevOps Engineers, while others may not, this is dependent on the size of the Enterprise. For Enterprises who are not afforded DevOps Engineers, the IT Administrator or AI Practitioner may need to proceed with following DevOps section to deploy the model to Triton Inference Server.
Note
For large scale production inference deployments, please refer to the Appendix – Scaling Triton Inference Server. IT Administrators can either use a traditional approach using a load balancer or use Kubernetes to deploy and auto scale Triton.