Amazon Web Services (AWS)#

Steps to use OneClick script to deploy on AWS.

The cloud deployment uses the same helm chart and the default topology as detailed at Default Deployment Topology and Models in Use.

Note

This release of the the OneClick scripts supports single-node deployments to AWS as documented in this page. Future releases will support more configurability as well as deployment to other Cloud Service Providers.

Prerequisites#

Host system Prerequisites#

  • Ubuntu 22.04

  • No GPU required

The below tools need to be installed on the OS of the host system for the scripts to execute

  • Install jq

sudo apt update
sudo apt-get install -y jq
  • Install yq

sudo wget https://github.com/mikefarah/yq/releases/download/v4.34.1/yq_linux_amd64 -O /usr/bin/yq
sudo chmod +x /usr/bin/yq
  • Install python3, venv and pip

sudo apt update
sudo apt-get install python3.10 python3-venv python3-pip -y
  • Install terraform

curl --silent -L https://raw.githubusercontent.com/versus/terraform-switcher/release/install.sh | sudo bash
mkdir -p "${HOME}/bin"
tfswitch -qb "${HOME}/bin/terraform" 1.5.7
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

AWS pre-requisites#

  • AWS access credentials

  1. On your AWS account, procure access key ID and secret access key for programmatic access to your AWS resources.

  2. Prefer to obtain a non-root IAM user with administrator access.

  3. Refer to the AWS documentation to create access key.

  • S3 Bucket for Backend

  1. This script uses S3 buckets to store the references to the resources that it spins up.

  2. Create an S3 bucket to be used to store the deployment state.

  3. Ensure the bucket is not public accessible but rather only to your account (such as using the keys procured in the previous step).

  4. Refer to the AWS documentation to create an S3 bucket.

  • DynamoDB Table for Backend

  1. This script uses DynamoDB tables to prevent concurrent access to the same deployment as they are being spun up.

  2. Create a DynamoDB table to be used to manage access to the deployment state.

  3. Define the Partition key as LockID and type String.

  4. The Sort key need not be defined.

  5. Refer to the AWS documentation to create a dynamo db table.

1. Download OneClick deployment package#

1.1 OneClick deployment package: deploy-aws-cns.tar.gz

Obtain the NGC API Key and run the following commands:

API_KEY=<your-NGC-api-key>
TOKEN=$(curl -s -u "\$oauthtoken":"$API_KEY" -H 'Accept:application/json' 'https://authn.nvidia.com/token?service=ngc&scope=group/ngc:nvidia&group/ngc:nvidia/blueprint' | jq -r '.token')
curl -LO 'https://api.ngc.nvidia.com/v2/org/nvidia/team/blueprint/resources/vss-deployment-scripts/versions/2.1.0-aws-cns/files/deploy-aws-cns.tar.gz' -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json"

Alternatively, you can download the tar from the browser by navigating to https://catalog.ngc.nvidia.com/orgs/nvidia/teams/blueprint/resources/vss-deployment-scripts and clicking on the Download dropdown on the top right of the page and then selecting Browser (Direct Download).

1.2 Untar the package

tar -xvf deploy-aws-cns.tar.gz

2. Prepare env variables#

2.1 Prepare a file, via-env-template.txt, to hold required env variables and their values:

via-env-template.txt

#Env: AWS secrets
#Env: AWS secrets
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=

#Env: Nvidia Secrets:
export NGC_API_KEY=

#Env: OpenAI Secrets [Optional]:
export OPENAI_API_KEY=

#Env: App secrets
export VIA_DB_PASSWORD=password

#Non secrets:

#AWS Resources created above in Section: AWS pre-requisites
export VIA_DEPLOY_AWS_DYT='dyt-table-name'
export VIA_DEPLOY_AWS_S3B='s3-bucket-name'
export VIA_DEPLOY_AWS_S3BR='us-west-2'

#Unique name for the VSS deployment
export VIA_DEPLOY_ENV='vss-deployment'

Note

  • NGC_API_KEY: refer to Obtain NGC API Key.

  • OPENAI_API_KEY: refer to Obtain OpenAI API Key.

  • You may consider updating VIA_DEPLOY_ENV to something other than default to identify the deployment. E.g. VIA_DEPLOY_ENV=’vss-demo’

2.2 Load these env variables into your current shell session:

source via-env-template.txt

3. Prepare config file#

Make a copy of config-template.yml of your own choice, eg config.yml. Or you can populate the config file as based on definition of each attribute.

example config.yml

schema_version: '0.0.9'
name: "via-aws-cns-{{ lookup('env', 'VIA_DEPLOY_ENV') }}"
spec:
  infra:
    csp: 'aws'
    backend:
      access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
      secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
      dynamodb_table: "{{ lookup('env', 'VIA_DEPLOY_AWS_DYT') }}"
      bucket: "{{ lookup('env', 'VIA_DEPLOY_AWS_S3B') }}"
      region: "{{ lookup('env', 'VIA_DEPLOY_AWS_S3BR') }}"
      encrypt: true
    provider:
      access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
      secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
    configs:
      cns:
        version: 12.2
        git_ref: 4d97cb7e8ca6e45fe9252888b7a918b2677f1fc9
        override_values:
          cns_nvidia_driver: yes
          gpu_driver_version: '535.216.03'
      access_cidrs:
      - 'xxx.xxx.xxx.xxx/32' ### Make sure to update this
      region: 'us-west-2'  ### Update this to change the deployment region
      ssh_public_key: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
      ssh_private_key_path: "{{ lookup('env', 'HOME') + '/.ssh/id_rsa' }}"
      additional_ssh_public_keys:
      - "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
      clusters:
        app:
          private_instance: false
          master:
            ### 8xH100 HBM3. Modify this to change the GPU type. Alternatives:
            ### - p4d.24xlarge (8 x A100)
            ### - g6e.48xlarge (8 x L40S)
            type: 'p5.48xlarge'
            # az: 'us-west-2c'  ### Update this to change the availability zone for the deployment.
            labels: {}
            taints: []
            # capacity_reservation_id: 'cr-3b7e4c9f1a6d8e2b'
#         nodes:
#           <key>:
#             type: 'p5.48xlarge'
#             az: 'us-west-2c'
#             labels: {}
#             taints: []
#             capacity_reservation_id: 'cr-foobar'
          ports:
            backend:
              port: 30081
            frontend:
              port: 30082
          features:
            cns: true
            platform: true
            app: true
  platform:
    configs:
      namespace: 'default'
  app:
    configs:
      namespace: 'default'
      backend_port: 'backend'
      frontend_port: 'frontend'
      ngc_api_key: "{{ lookup('env', 'NGC_API_KEY') }}"
      openai_api_key: "{{ lookup('env', 'OPENAI_API_KEY') }}"
      db_username: 'neo4j'
      db_password: "{{ lookup('env', 'VIA_DB_PASSWORD') | default('password') }}"
      vss_chart:
        repo:
          name: 'nvidia-blueprint'
          url: 'https://helm.ngc.nvidia.com/nvidia/blueprint'
        chart: 'nvidia-blueprint-vss' # repo should be removed/commented-out when using local charts
        version: '2.1.0'
        #override_values_file_absolute_path: '/home/nvidia/aws/dist/override.yaml'

Note

The above is just a reference. In case you face issues, please make a copy of config-template.yml and update it as required.

Note

Make sure to update the region, the node type, capacity_reservation_id (optional), access_cidrs and override_values_file_absolute_path in the config file. For access_cidrs, run echo `curl ifconfig.me`/32 to get user machine’s IP range.

To use an overrides values file to customize the various parts of the VSS blueprint deployment:

  • Uncomment and update override_values_file_absolute_path shown above in config.yml to set the actual path to the overrides file.

  • Uncomment line - "{{ configs.vss_chart.override_values_file_absolute_path }}" in dist/app-tasks.yml near the end of the file.

Attributes of the config-template.yml

Attribute

Optional

Description

name

A unique name to identify the infrastructure resources being created by.

spec > infra > backend > access_key

AWS access key ID used to access the backend bucket and table.

spec > infra > backend > secret_key

AWS secret access key used to access the backend bucket and table.

spec > infra > backend > dynamodb_table

Name of the AWS Dynamo DB table used to manage concurrent access to the state.

spec > infra > backend > bucket

Name of the AWS S3 bucket in which state of the resources provisioned is stored.

spec > infra > backend > region

AWS region where state S3 bucket and Dynamo DB table are created.

spec > infra > backend > encrypt

Whether to encrypt the state while stored in S3 bucket.

spec > infra > provider > access_key

AWS access key ID used to provision resources.

spec > infra > provider > secret_key

AWS secret access key used to provision resources.

spec > infra > configs > cns

yes

CNS configurations.

spec > infra > configs > cns > version

yes

The version of CNS to install on the clusters. Defaults to 11.0.

spec > infra > configs > cns > override_values

yes

CNS values to override while setting up a cluster.

spec > infra > configs > cns > override_values > cns_value

yes

The value of the cns_value found in cns_values.yaml.

spec > infra > configs > access_cidrs

List of CIDRs from which app will be accessible. 1) echo `curl ifconfig.me`/32 of the users controller machine is automatically added to the list of access_cidrs. Please add this for other machines that need access to the deployment service. However, if this outbound IP changes, the whitelist will not work. 2) The recommended solution is to ask the organization IT admins for the outbound IPs of users other than the one running the deployment. Note: It may or may not be a /32 CIDR

spec > infra > configs > region

AWS region in which to bring up the resources.

spec > infra > configs > ssh_private_key_path

Absolute path of the private key to be used to SSH the hosts.

spec > infra > configs > ssh_public_key

Content of the public counterpart of the private key used to SSH the hosts.

spec > infra > configs > additional_ssh_public_keys

yes

List of contents of public counterparts to the additional keys that will be used to SSH the hosts.

spec > infra > configs > bastion

yes

Details of the AWS instance to be used as a bastion host in case of private clusters.

spec > infra > configs > bastion > type

yes

AWS instance type for the bastion node (if required). Defaults to t3.small.

spec > infra > configs > bastion > az

yes

AWS availability zone in the region for the bastion node (if required). Defaults to the first (alphabetically) AZ of the region.

spec > infra > configs > bastion > disk_size_gb

yes

Root volume disk size for the bastion node. Defaults to 128.

spec > infra > configs > clusters

Definitions of clusters to be created.

spec > infra > configs > clusters > cluster

Unique key to identify a cluster. There can be 1 or more clusters.

spec > infra > configs > clusters > cluster > private_instance

yes

If true, creates the cluster instances within a private subnet. Defaults to false

spec > infra > configs > clusters > cluster > master

Definitions of the master node of the cluster.

spec > infra > configs > clusters > cluster > master > type

yes

AWS instance type for the master node. Defaults to p5.48xlarge.

spec > infra > configs > clusters > cluster > master > az

yes

AWS availability zone in the region for the master node. Defaults to the first (alphabetically) AZ of the region.

spec > infra > configs > clusters > cluster > master > disk_size_gb

yes

Root volume disk size for the master node. Defaults to 1024.

spec > infra > configs > clusters > cluster > master > labels

yes

Labels to apply to the master node. Defaults to {}.

spec > infra > configs > clusters > cluster > master > taints

yes

Taints to apply to the master node. Defaults to [].

spec > infra > configs > clusters > cluster > master > capacity_reservation_id

yes

The capacity reservation ID to use to bring up the master node.

spec > infra > configs > clusters > cluster > nodes

yes

Definitions of nodes of the cluster. Set to {} if no extra nodes other than master needed.

spec > infra > configs > clusters > cluster > nodes > node

Unique key to identify a node. There can be 0 or more nodes.

spec > infra > configs > clusters > cluster > nodes > node > type

yes

AWS instance type for the node node. Defaults to p5.48xlarge.

spec > infra > configs > clusters > cluster > nodes > node > az

yes

AWS availability zone in the region for the node node. Defaults to the first (alphabetically) AZ of the region.

spec > infra > configs > clusters > cluster > nodes > node > disk_size_gb

yes

Root volume disk size for the node node. Defaults to 1024.

spec > infra > configs > clusters > cluster > nodes > node > labels

yes

Labels to apply to the node node. Defaults to {}.

spec > infra > configs > clusters > cluster > nodes > node > taints

yes

Taints to apply to the node node. Defaults to [].

spec > infra > configs > clusters > cluster > nodes > node > capacity_reservation_id

yes

The capacity reservation ID to use to bring up the node node.

spec > infra > configs > clusters > cluster > ports

yes

Definitions of ports of the cluster. Set to {} if no ports are exposed by the cluster.

spec > infra > configs > clusters > cluster > ports > port

Unique key to identify a port. There can be 0 or more ports.

spec > infra > configs > clusters > cluster > ports > port > port

The port number of the port.

spec > infra > configs > clusters > cluster > ports > port > protocol

yes

The protocol of the port. Defaults to http.

spec > infra > configs > clusters > cluster > ports > port > path

yes

The path of the application on the port for the landing URL. Defaults to /.

spec > infra > configs > clusters > cluster > features

yes

Definitions of features of the cluster. Set to {} if no features defined for the cluster.

spec > infra > configs > clusters > cluster > features > feature

Key to identify a feature and value represents enabled/disabled by setting it to true/false. There can be 0 or more features.

spec > platform > configs > namespace

yes

Namespace to deploy the platform components in. Defaults to default.

spec > app > configs > namespace

yes

Namespace to deploy the app in. Defaults to default.

spec > app > configs > backend_port

Identifier of the port in the cluster to expose the api over.

spec > app > configs > frontend_port

Identifier of the port in the cluster to expose the ui over.

spec > app > configs > ngc_api_key

NGC API key used to download application charts, models and containers.

spec > app > configs > openai_api_key

OPENAI API key used by the application.

spec > app > configs > db_username

The username used to access the DB.

spec > app > configs > db_password

The password used to access the DB.

spec > app > configs > vss_chart

Configuration details of the VSS chart.

spec > app > configs > vss_chart > repo

yes

Helm repo details of the chart. Can be ignored if using a local chart.

spec > app > configs > vss_chart > repo > name

Name provided to refer the added helm repo.

spec > app > configs > vss_chart > repo > url

Url of the helm repo containing the chart.

spec > app > configs > vss_chart > chart

The name of the chart in case of a remote repo source. The absolute path of a local chart.

spec > app > configs > vss_chart > version

The version of the chart.

4. Run OneClick script to deploy on AWS#

4.1 Make sure the host machine to run OneClick script has rsa keys generated. If not, use the following command to generate

sudo apt-get install -y openssh-client
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa

4.2 Installing

Use config from 2. Prepare env variables. Place it in the dist directory.

cd dist/
./envbuild.sh install -f config.yml -c all

Note

In case you face an error like could not process config file: ... while restarting/redeploying, try removing the temporary directory that is shown in the error logs. Example: rm -rf <dist-directory>/tmp.dZM7is5HUC

Note

This project downloads and installs additional third-party open source software projects. Review the license terms of these open source projects before use.

5. Access the deployment#

Once successful, the above command will provide the logs similar to the following.

access_urls:
app:
    backend: http://<NODE-IP>:30081/
    frontend: http://<NODE-IP>:30082/
ssh_command:
app:
    master: ssh -i $HOME/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@<NODE-IP>

Note

You must wait till the deployment installation is fully complete before trying to access the nodes.

We can also get this info after successful deployment on demand using command:

cd dist/
./envbuild.sh -f config.yml info

Next, we need to wait for all pods and services to be up. Log in to the node using the ssh command shown above and check pod status using kubectl get pod.

ssh -i $HOME/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@<NODE-IP>
kubectl get pod

Make sure all pods are in Running or Completed STATUS and shows 1/1 as READY as shown below.

Create new secret key

Note

The terraform scripts will install the kubectl utility. Users must not install kubernetes or kubectl manually.

Additionally, to make sure the VSS API and UI are ready and accessible, please check logs for deployment using command:

kubectl logs vss-vss-deployment-POD-NAME

Please make sure the below logs are present and user does not see any errors:

Application startup complete.
Uvicorn running on http://0.0.0.0:9000

The VSS API and UI are now ready to be accessed at http://<NODE-IP>:30081 and http://<NODE-IP>:30082 respectively. Test the deployment by summarizing a sample video.

6. Teardown#

Un-installing

$ cd dist/
$ ./envbuild.sh uninstall -f config.yml -c all

Common Issues#

vCPU Limit Exceeded#

If the AWS account has a lower vCPU quota than required by the instance type requested, you may see a “vCPU Limit Exceeded” error. Please refer to https://repost.aws/knowledge-center/ec2-on-demand-instance-vcpu-increase for increasing the vCPU quota. Refer to https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html#quota-cli-increase for AWS CLI instructions.

VSS pod is failing and restarting on L40/L40S node#

The VSS container startup might be timing out an L40/L40S node when VILA1.5 is used as the VLM (default). Try increasing the startup timeout by using an overides file with following values:

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          startupProbe:
            failureThreshold: 360