Microservices Setup#
The Fine-Tuning Microservice (FTMS) API can run on any Docker or Kubernetes platform. This section describes how to install the prerequisites for the platform setup.
Warning
Ensure the protection of sensitive information by implementing robust platform access controls, including storage encryption, VPC, and firewall setups on both CSPs and local (bare-metal) deployments. Limit infrastructure access to a select few, such as AWS account access and NVIDIA NVCF Admin access. It is recommended to use Vault for securing secrets on CSPs or local deployments; otherwise, platform access control remains the sole protection layer. Additionally, the responsibility for access logs, platform usage, and cost monitoring lies with the platform deployment user.
Hardware#
Minimum Requirements#
1 or more GPU node(s) where all GPUs within a given node match.
32 GB of system RAM
32 GB of GPU RAM
8-core CPU
One or more NVIDIA Discrete GPU(s): Volta, Turing, Ampere, Hopper, or Blackwell architecture
60 GB of free disk space
Software#
NVIDIA GPU Driver (version 570)
Docker
NVIDIA Container Toolkit
AWS CLI
NGC API Keys
OS Support#
FTMS requires a Linux based operating system. The recommended OS is Ubuntu 22.04.
Installing Prerequisites#
The following steps assume that you are using Ubuntu 22.04 and NVIDIA GPU Driver version 570. For other Linux OS or NVIDIA GPU driver versions, refer to the corresponding documentation.
Install the NVIDIA GPU driver. Refer to Driver Installation Guide.
#install nv driver sudo apt-get update && sudo apt-get install nvidia-driver-570 sudo reboot
Note
These commands install or upgrade the NVIDIA GPU Driver, then reboot the machine
If you have a multi-GPU machine that is NVSwitch-based, you may need to also install the NVIDIA Fabric Manager.
(Optional) Verify that the NVIDIA GPU Driver is installed correctly.
nvidia-smi
Install Docker. Refer to Install Docker Engine.
# Add Docker's official GPG key: sudo apt-get update sudo apt-get install ca-certificates curl sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc # Add the repository to Apt sources: echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update #install docker sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin #add user to docker group sudo usermod -aG docker $USER newgrp docker
Install NVIDIA Container Toolkit. Refer to Installing the NVIDIA Container Toolkit.
#install nvidia container toolkit curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
Install AWS CLI. Refer to AWS CLI documentation.
#install aws cli curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install
Obtain API Keys from NGC
Note
You must generate both an NGC personal key and an NGC legacy API key. You must generate the personal key from the NGC organization that you will be using with FTMS.
Cloud Storage Setup#
Cloud storage is optional for FTMS to upload model checkpoints, logs, and other training artifacts. You can also bring your datasets to FTMS via local storage.
AWS S3#
Create an S3 bucket. Refer to Getting started with Amazon S3.
Create a user with access to the S3 bucket. Refer to Create a user with administrative access.
Store the user credentials, S3 bucket region, and S3 bucket name securely. These will be used to create a cloud workspace for FTMS.
Azure Blob Storage#
Create a storage account. Refer to Introduction to Azure Blob Storage.
Create a user with access to the storage account. Refer to Create an Azure storage account.
Store the user credentials, storage account region, and storage account name securely. These will be used to create a cloud workspace for FTMS.