Amazon Web Services (AWS)#
Steps to use OneClick script to deploy on AWS.
The cloud deployment uses the same helm chart and the default topology as detailed at Default Deployment Topology and Models in Use.
Note
This release of the the OneClick scripts supports single-node deployments to AWS as documented in this page. Future releases will support more configurability as well as deployment to other Cloud Service Providers.
Prerequisites#
Host system Prerequisites#
Ubuntu 22.04
No GPU required
The below tools need to be installed on the OS of the host system for the scripts to execute
Install jq
sudo apt update
sudo apt-get install -y jq
Install yq
sudo wget https://github.com/mikefarah/yq/releases/download/v4.34.1/yq_linux_amd64 -O /usr/bin/yq
sudo chmod +x /usr/bin/yq
Install python3, venv and pip
sudo apt update
sudo apt-get install python3.10 python3-venv python3-pip -y
Install OpenTofu
sudo snap install --classic opentofu
Note
Minimum version of OpenTofu must be 1.9.0.
AWS pre-requisites#
AWS access credentials
On your AWS account, procure access key ID and secret access key for programmatic access to your AWS resources.
Prefer to obtain a non-root IAM user with administrator access.
Refer to the AWS documentation to create access key.
S3 Bucket for Backend
This script uses S3 buckets to store the references to the resources that it spins up.
Create an S3 bucket to be used to store the deployment state.
Ensure the bucket is not public accessible but rather only to your account (such as using the keys procured in the previous step).
Refer to the AWS documentation to create an S3 bucket.
DynamoDB Table for Backend
This script uses DynamoDB tables to prevent concurrent access to the same deployment as they are being spun up.
Create a DynamoDB table to be used to manage access to the deployment state.
Define the Partition key as LockID and type String.
The Sort key need not be defined.
Refer to the AWS documentation to create a dynamo db table.
1. Download OneClick deployment package#
1.1 OneClick deployment package: deploy-aws-cns.tar.gz
Obtain the NGC API Key and run the following commands.
Note
Please generate and use the Legacy NGC API Key to use curl command and download the package as instructed below.
API_KEY=<your-NGC-api-key>
TOKEN=$(curl -s -u "\$oauthtoken":"$API_KEY" -H 'Accept:application/json' 'https://authn.nvidia.com/token?service=ngc&scope=group/ngc:nvidia&group/ngc:nvidia/blueprint' | jq -r '.token')
curl -LO 'https://api.ngc.nvidia.com/v2/org/nvidia/team/blueprint/resources/vss-deployment-scripts/versions/2.3.0-aws-cns/files/deploy-aws-cns.tar.gz' -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json"
Alternatively, you can download the tar from the browser by navigating to https://catalog.ngc.nvidia.com/orgs/nvidia/teams/blueprint/resources/vss-deployment-scripts and clicking on the Download dropdown on the top right of the page and then selecting Browser (Direct Download).
1.2 Untar the package
tar -xvf deploy-aws-cns.tar.gz
Note
If you face an error like tar: This does not look like a tar archive
while untarring the package,
make sure you have access to NGC org/team <details> to download the 1-click deployment package.
Please check FAQ 1-click deploy package deploy-aws-cns.tar.gz untar failure for more info.
2. Prepare env variables#
2.1 Prepare a file, via-env-template.txt, to hold required env variables and their values:
via-env-template.txt
#Env: AWS secrets
#Env: AWS secrets
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
#Env: Nvidia Secrets:
export NGC_API_KEY=
#Env: OpenAI Secrets [Optional]:
export OPENAI_API_KEY=
#Env: App secrets
export VIA_DB_PASSWORD=password
#Non secrets:
#AWS Resources created above in Section: AWS pre-requisites
export VIA_DEPLOY_AWS_DYT='dyt-table-name'
export VIA_DEPLOY_AWS_S3B='s3-bucket-name'
export VIA_DEPLOY_AWS_S3BR='us-west-2'
#Unique name for the VSS deployment
export VIA_DEPLOY_ENV='vss-deployment'
Note
NGC_API_KEY
: refer to Obtain NGC API Key.OPENAI_API_KEY
: refer to Obtain OpenAI API Key.You may consider updating
VIA_DEPLOY_ENV
to something other than default to identify the deployment. E.g.VIA_DEPLOY_ENV='vss-demo'
Make sure to update
VIA_DEPLOY_AWS_DYT
,VIA_DEPLOY_AWS_S3B
,VIA_DEPLOY_AWS_S3BR
to reflect the AWS Resources created above in Section: AWS pre-requisites.
2.2 Load these env variables into your current shell session:
source via-env-template.txt
3. Prepare config file#
Make a copy of config-template.yml
of your own choice, eg config.yml.
Or you can populate the config file as based on definition of each attribute.
example config.yml
schema_version: '0.0.10'
name: "via-aws-cns-{{ lookup('env', 'VIA_DEPLOY_ENV') }}"
spec:
infra:
csp: 'aws'
backend:
access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
dynamodb_table: "{{ lookup('env', 'VIA_DEPLOY_AWS_DYT') }}"
bucket: "{{ lookup('env', 'VIA_DEPLOY_AWS_S3B') }}"
region: "{{ lookup('env', 'VIA_DEPLOY_AWS_S3BR') }}"
encrypt: true
provider:
access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
configs:
cns:
version: 12.2
git_ref: ebf03a899f13a0adaaa3d9f31a299cff6ab33eb3
override_values:
cns_nvidia_driver: yes
gpu_driver_version: '535.216.03'
access_cidrs:
- 'my-org-ip-cidr' ### Make sure to update this e.g. 'xxx.xxx.xxx.xxx/32' for a single IP address.
region: 'us-west-2' ### Update this to change the deployment region
ssh_public_key: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
ssh_private_key_path: "{{ lookup('env', 'HOME') + '/.ssh/id_rsa' }}"
additional_ssh_public_keys:
### Add public key of another system e.g. your colleague's if you are working in a team else add your own.
- "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
clusters:
app:
private_instance: false
master:
### 8xH100 HBM3. Modify this to change the GPU type. Alternatives:
### - p4d.24xlarge (8 x A100)
### - g6e.48xlarge (8 x L40S)
type: 'p5.48xlarge'
# az: 'us-west-2c' ### Update this to change the availability zone for the deployment.
labels: {}
taints: []
# capacity_reservation_id: 'cr-foobar' ### Add this if you are using a capacity reservation.
# is_capacity_block: false ### Set this to true if using capacity_reservation_id and reservation is of type "Capacity Block for ML"
# nodes:
# <key>:
# type: 'p5.48xlarge'
# az: 'us-west-2c'
# labels: {}
# taints: []
# capacity_reservation_id: 'cr-foobar'
# is_capacity_block: false
ports:
backend:
port: 30081
frontend:
port: 30082
features:
cns: true
platform: true
app: true
platform:
configs:
namespace: 'default'
app:
configs:
namespace: 'default'
backend_port: 'backend'
frontend_port: 'frontend'
ngc_api_key: "{{ lookup('env', 'NGC_API_KEY') }}"
openai_api_key: "{{ lookup('env', 'OPENAI_API_KEY') }}"
db_username: 'neo4j'
db_password: "{{ lookup('env', 'VIA_DB_PASSWORD') | default('password') }}"
vss_chart:
repo:
name: 'nvidia-blueprint'
url: 'https://helm.ngc.nvidia.com/nvidia/blueprint'
chart: 'nvidia-blueprint-vss' # repo should be removed/commented-out when using local charts
version: '2.3.0'
#override_values_file_absolute_path: '/home/user/vss-values.yml'
Note
The above is just a reference. In case you face issues, please make a copy of config-template.yml
and update it as required.
Note
Make sure to update config sections the (spec/infra/configs/clusters/app/master ; and /nodes for multi-node deployments):
region
(spec/infra/configs/), the node type
, capacity_reservation_id
,
is_capacity_block
is true if using Capacity Blocks for ML type capacity reservation, else false/undefined,
az
(az
is optional; Needed when using capacity_reservation and it is bound to a particular availability zone)
access_cidrs
: run echo `curl ifconfig.me`/32
to get user machine’s IP range.
override_values_file_absolute_path
(optional) in the config file: To use the overrides.yaml file as discussed in sections like Configuration Options.
To use a helm overrides values file to customize the various parts of the VSS blueprint deployment:
Uncomment and update
override_values_file_absolute_path
shown above inconfig.yml
to set the actual path to the overrides file.Uncomment line
- "{{ configs.vss_chart.override_values_file_absolute_path }}"
indist/app-tasks.yml
near the end of the file.More information on the VSS helm overrides file can be found in Section Configuration Options.
Attributes of the config-template.yml
Attribute |
Optional |
Description |
---|---|---|
name |
A unique name to identify the infrastructure resources being created by. |
|
spec > infra > backend > access_key |
AWS access key ID used to access the backend bucket and table. |
|
spec > infra > backend > secret_key |
AWS secret access key used to access the backend bucket and table. |
|
spec > infra > backend > dynamodb_table |
Name of the AWS Dynamo DB table used to manage concurrent access to the state. |
|
spec > infra > backend > bucket |
Name of the AWS S3 bucket in which state of the resources provisioned is stored. |
|
spec > infra > backend > region |
AWS region where state S3 bucket and Dynamo DB table are created. |
|
spec > infra > backend > encrypt |
Whether to encrypt the state while stored in S3 bucket. |
|
spec > infra > provider > access_key |
AWS access key ID used to provision resources. |
|
spec > infra > provider > secret_key |
AWS secret access key used to provision resources. |
|
spec > infra > configs > cns |
yes |
CNS configurations. |
spec > infra > configs > cns > version |
yes |
The version of CNS to install on the clusters. Defaults to 11.0. |
spec > infra > configs > cns > override_values |
yes |
CNS values to override while setting up a cluster. |
spec > infra > configs > cns > override_values > cns_value |
yes |
The value of the cns_value found in cns_values.yaml. |
spec > infra > configs > access_cidrs |
List of CIDRs from which app will be accessible.
1) |
|
spec > infra > configs > region |
AWS region in which to bring up the resources. |
|
spec > infra > configs > ssh_private_key_path |
Absolute path of the private key to be used to SSH the hosts. |
|
spec > infra > configs > ssh_public_key |
Content of the public counterpart of the private key used to SSH the hosts. |
|
spec > infra > configs > additional_ssh_public_keys |
yes |
List of contents of public counterparts to the additional keys that will be used to SSH the hosts. |
spec > infra > configs > bastion |
yes |
Details of the AWS instance to be used as a bastion host in case of private clusters. |
spec > infra > configs > bastion > type |
yes |
AWS instance type for the bastion node (if required). Defaults to t3.small. |
spec > infra > configs > bastion > az |
yes |
AWS availability zone in the region for the bastion node (if required). Defaults to the first (alphabetically) AZ of the region. |
spec > infra > configs > bastion > disk_size_gb |
yes |
Root volume disk size for the bastion node. Defaults to 128. |
spec > infra > configs > clusters |
Definitions of clusters to be created. |
|
spec > infra > configs > clusters > cluster |
Unique key to identify a cluster. There can be 1 or more clusters. |
|
spec > infra > configs > clusters > cluster > private_instance |
yes |
If true, creates the cluster instances within a private subnet. Defaults to false |
spec > infra > configs > clusters > cluster > master |
Definitions of the master node of the cluster. |
|
spec > infra > configs > clusters > cluster > master > type |
yes |
AWS instance type for the master node. Defaults to p5.48xlarge. |
spec > infra > configs > clusters > cluster > master > az |
yes |
AWS availability zone in the region for the master node. Defaults to the first (alphabetically) AZ of the region. |
spec > infra > configs > clusters > cluster > master > disk_size_gb |
yes |
Root volume disk size for the master node. Defaults to 1024. |
spec > infra > configs > clusters > cluster > master > labels |
yes |
Labels to apply to the master node. Defaults to {}. |
spec > infra > configs > clusters > cluster > master > taints |
yes |
Taints to apply to the master node. Defaults to []. |
spec > infra > configs > clusters > cluster > master > capacity_reservation_id |
yes |
The capacity reservation ID to use to bring up the master node. |
spec > infra > configs > clusters > cluster > master > is_capacity_block |
yes |
Whether the provided capacity reservation ID for master node is from a Capacity Block. |
spec > infra > configs > clusters > cluster > nodes |
yes |
Definitions of nodes of the cluster. Set to {} if no extra nodes other than master needed. |
spec > infra > configs > clusters > cluster > nodes > node |
Unique key to identify a node. There can be 0 or more nodes. |
|
spec > infra > configs > clusters > cluster > nodes > node > type |
yes |
AWS instance type for the node node. Defaults to p5.48xlarge. |
spec > infra > configs > clusters > cluster > nodes > node > az |
yes |
AWS availability zone in the region for the node node. Defaults to the first (alphabetically) AZ of the region. |
spec > infra > configs > clusters > cluster > nodes > node > disk_size_gb |
yes |
Root volume disk size for the node node. Defaults to 1024. |
spec > infra > configs > clusters > cluster > nodes > node > labels |
yes |
Labels to apply to the node node. Defaults to {}. |
spec > infra > configs > clusters > cluster > nodes > node > taints |
yes |
Taints to apply to the node node. Defaults to []. |
spec > infra > configs > clusters > cluster > nodes > node > capacity_reservation_id |
yes |
The capacity reservation ID to use to bring up the node node. |
spec > infra > configs > clusters > cluster > nodes > node > is_capacity_block |
yes |
Whether the provided capacity reservation ID for node node is from a Capacity Block. |
spec > infra > configs > clusters > cluster > ports |
yes |
Definitions of ports of the cluster. Set to {} if no ports are exposed by the cluster. |
spec > infra > configs > clusters > cluster > ports > port |
Unique key to identify a port. There can be 0 or more ports. |
|
spec > infra > configs > clusters > cluster > ports > port > port |
The port number of the port. |
|
spec > infra > configs > clusters > cluster > ports > port > protocol |
yes |
The protocol of the port. Defaults to http. |
spec > infra > configs > clusters > cluster > ports > port > path |
yes |
The path of the application on the port for the landing URL. Defaults to /. |
spec > infra > configs > clusters > cluster > features |
yes |
Definitions of features of the cluster. Set to {} if no features defined for the cluster. |
spec > infra > configs > clusters > cluster > features > feature |
Key to identify a feature and value represents enabled/disabled by setting it to true/false. There can be 0 or more features. |
|
spec > platform > configs > namespace |
yes |
Namespace to deploy the platform components in. Defaults to default. |
spec > app > configs > namespace |
yes |
Namespace to deploy the app in. Defaults to default. |
spec > app > configs > backend_port |
Identifier of the port in the cluster to expose the api over. |
|
spec > app > configs > frontend_port |
Identifier of the port in the cluster to expose the ui over. |
|
spec > app > configs > ngc_api_key |
NGC API key used to download application charts, models and containers. |
|
spec > app > configs > openai_api_key |
OPENAI API key used by the application. |
|
spec > app > configs > db_username |
The username used to access the DB. |
|
spec > app > configs > db_password |
The password used to access the DB. |
|
spec > app > configs > vss_chart |
Configuration details of the VSS chart. |
|
spec > app > configs > vss_chart > repo |
yes |
Helm repo details of the chart. Can be ignored if using a local chart. |
spec > app > configs > vss_chart > repo > name |
Name provided to refer the added helm repo. |
|
spec > app > configs > vss_chart > repo > url |
Url of the helm repo containing the chart. |
|
spec > app > configs > vss_chart > chart |
The name of the chart in case of a remote repo source. The absolute path of a local chart. |
|
spec > app > configs > vss_chart > version |
The version of the chart. |
4. Run OneClick script to deploy on AWS#
4.1 Make sure the host machine to run OneClick script has rsa keys generated. If not, use the following command to generate
sudo apt-get install -y openssh-client ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
4.2 Deployment
Use config from 2. Prepare env variables. Place it in the dist
directory.
4.2.1 Choose alternate instance type
Default instance type is: 8xH100 HBM3. Users may change this to other configurations. This change can be made in the config.yaml file available in Section: 2. Prepare env variables.
### Default is: 8xH100 HBM3.
### Modify this to change the GPU type.
### Alternatives:
### - p4d.24xlarge (8 x A100)
### - g6e.48xlarge (8 x L40S)
type: 'p5.48xlarge'
More info on the instance types available can be found in the AWS EC2 documentation here.
4.2.2 Deploy
cd dist/
./envbuild.sh install -f config.yml -c all
Note
In case you face an error like could not process config file: ...
while restarting/redeploying,
try removing the temporary directory that is shown in the error logs. Example: rm -rf <dist-directory>/tmp.dZM7is5HUC
Please also confirm SSH key was generated/present on the path mentioned in the config for keys - ssh_public_key, ssh_private_key_path. If there is no need for additional_ssh_public_keys, please comment that out in the config file. Its by default enabled to use $HOME/.ssh/my-colleague-1.pub
Note
With AWS deployments, if you face authentication errors like api error AuthFailure: AWS was not able to validate the provided access credentials
:
Please check your system clock - and make sure its in synch with NTP. If the synch is off, AWS authentication errors like this is observed.
Note
This project downloads and installs additional third-party open source software projects. Review the license terms of these open source projects before use.
5. Access the deployment#
Once successful, the above command will provide the logs similar to the following.
access_urls:
app:
backend: http://<NODE-IP>:30081/
frontend: http://<NODE-IP>:30082/
ssh_command:
app:
master: ssh -i $HOME/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@<NODE-IP>
Note
You must wait till the deployment installation is fully complete before trying to access the nodes.
We can also get this info after successful deployment on demand using command:
cd dist/
./envbuild.sh -f config.yml info
Next, we need to wait for all pods and services to be up. Log in to the node using the ssh command
shown above and check pod status using kubectl get pod
.
ssh -i $HOME/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@<NODE-IP>
kubectl get pod
Make sure all pods are in Running or Completed STATUS and shows 1/1 as READY as shown below.
Note
The terraform scripts will install the kubectl
utility. Users must not install kubernetes
or kubectl manually.
Additionally, to make sure the VSS API and UI are ready and accessible, please check logs for deployment using command:
kubectl logs vss-vss-deployment-POD-NAME
Please make sure the below logs are present and user does not see any errors:
Application startup complete.
Uvicorn running on http://0.0.0.0:9000
The VSS API and UI are now ready to be accessed at http://<NODE-IP>:30081 and http://<NODE-IP>:30082 respectively. Test the deployment by summarizing a sample video.
6. Teardown#
Un-installing
$ cd dist/
$ ./envbuild.sh uninstall -f config.yml -c all
Common Issues#
vCPU Limit Exceeded#
If the AWS account has a lower vCPU quota than required by the instance type requested, you may see a “vCPU Limit Exceeded” error. Please refer to https://repost.aws/knowledge-center/ec2-on-demand-instance-vcpu-increase for increasing the vCPU quota. Refer to https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html#quota-cli-increase for AWS CLI instructions.
VSS pod is failing and restarting on L40/L40S node#
The VSS container startup might be timing out an L40/L40S node when VILA1.5 is used as the VLM (default). Try increasing the startup timeout by using an overides file with following values:
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
startupProbe:
failureThreshold: 360
Error acquiring state lock#
If the deployment process is interrupted and you attempt to run the deployment script again, you may encounter the following error:
│ Error: Error acquiring the state lock
│
│ Error message: operation error DynamoDB: PutItem, https response error
│ StatusCode: 400, RequestID:
│ 7RBDR4H3BTG8EHNIQE1DSPQ6BFVV4KQNSO5AEMVJF66Q9ASUAAJG,
│ ConditionalCheckFailedException: The conditional request failed
│ Lock Info:
│ ID: <some-id>
│ Path: <some-path>
│ Operation: OperationTypeApply
│ Who: <some-user>
│ Version: <some-version>
│ Created: <some-timestamp>
│ Info:
│
│ OpenTofu acquires a state lock to protect the state from being written
│ by multiple users at the same time. Please resolve the issue above and try
│ again. For most commands, you can disable locking with the "-lock=false"
│ flag, but this is not recommended.
To resolve this issue, modify the destroy_tf_shape function in the envbuild.sh script by adding a force-unlock command. The updated function should look like this:
function destroy_tf_shape() {
echo "Destroying tf shape"
abort_option
switch_to_script_dir
debug_out "Initializing tf"
open_debug_banner
$(tf_binary) -chdir="${tmp_dir}/iac" init -reconfigure -backend-config=tf-backend.json \
>&$output_debug 2>&1
# Add this line with the updated value for <lock-id-as-seen-in-error>
echo "yes" | $(tf_binary) -chdir="${tmp_dir}/iac" force-unlock "<lock-id-as-seen-in-error>" >&$output_info 2>&1
# rest of the function remains the same
Note
The lock-id-as-seen-in-error is the ID of the lock that is causing the error.
After modifying the script, follow the un-installation steps to force-unlock and destroy the deployment. Once the un-installation is complete, revert the changes you made to the script and re-run the deployment script to create a fresh deployment.