Amazon Web Services (AWS)#

Steps to use OneClick script to deploy on AWS.

The cloud deployment uses the same Helm Chart and the default topology as detailed at Default Deployment Topology and Models in Use.

Note

This release of the OneClick scripts supports single-node deployments to AWS as documented in this page.

Prerequisites#

Host System Prerequisites#

Ubuntu 22.04
No GPU required

The following tools must be installed on the OS of the host system for the scripts to execute:

Install jq

sudo apt update
sudo apt-get install -y jq

Install yq

sudo wget https://github.com/mikefarah/yq/releases/download/v4.34.1/yq_linux_amd64 -O /usr/bin/yq
sudo chmod +x /usr/bin/yq

Install python3, venv, and pip

sudo apt update
sudo apt-get install python3.10 python3-venv python3-pip -y

Install OpenTofu

sudo snap install --classic opentofu

Note

Minimum version of OpenTofu must be 1.9.0.

AWS Prerequisites#

AWS access credentials
1. On your AWS account, procure access key ID and secret access key for programmatic access to your AWS resources.
2. Prefer to obtain a non-root IAM user with administrator access.
3. Refer to the AWS documentation to create access key.
S3 Bucket for Backend
1. This script uses S3 buckets to store the references to the resources that it spins up.
2. Create an S3 bucket to be used to store the deployment state.
3. Ensure the bucket is not public accessible but rather only to your account (such as using the keys procured in the previous step).
4. Refer to the AWS documentation to create an S3 bucket.
DynamoDB Table for Backend
1. This script uses DynamoDB tables to prevent concurrent access to the same deployment as they are being spun up.
2. Create a DynamoDB table to be used to manage access to the deployment state.
3. Define the Partition key as LockID and type String.
4. The Sort key need not be defined.
5. Refer to AWS Create a DynamoDB table.

Install OneClick Deployment Package#

Clone the NVIDIA AI Blueprint: Video Search and Summarization repository.

git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git

Untar the package:

cd video-search-and-summarization/deploy/one_click
tar -xvf deploy-aws-cns.tar.gz

Prepare `env` Variables#

Prepare a file, via-env-template.txt, to hold the required env variables and their values:

#Env: AWS secrets
#Env: AWS secrets
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=

#Env: Nvidia Secrets:
export NGC_API_KEY=

#Env: OpenAI Secrets [Optional]:
export OPENAI_API_KEY=

#Env: App secrets
export VIA_DB_PASSWORD=password
export ARANGO_DB_PASSWORD=password
export MINIO_SECRET_KEY=minio123

#Non secrets:

#AWS Resources created above in Section: AWS pre-requisites
export VIA_DEPLOY_AWS_DYT='dyt-table-name'
export VIA_DEPLOY_AWS_S3B='s3-bucket-name'
export VIA_DEPLOY_AWS_S3BR='us-west-2'

#Unique name for the VSS deployment
export VIA_DEPLOY_ENV='vss-deployment'
Note

NGC_API_KEY: refer to Obtain NGC API Key.

OPENAI_API_KEY: API key from https://platform.openai.com/api-keys.

You can consider updating VIA_DEPLOY_ENV to something other than default to identify the deployment. E.g. VIA_DEPLOY_ENV='vss-demo'

Make sure to update VIA_DEPLOY_AWS_DYT, VIA_DEPLOY_AWS_S3B, VIA_DEPLOY_AWS_S3BR to reflect the AWS Resources created in: AWS Prerequisites.

Load these env variables into your current shell session:
source via-env-template.txt

Prepare the `config` File#

Make a copy of config-template.yml of your own choice, for example, config.yml. Or you can populate the config file as based on definition of each attribute.

schema_version: '0.0.10'
name: "via-aws-cns-{{ lookup('env', 'VIA_DEPLOY_ENV') }}"
spec:
  infra:
    csp: 'aws'
    backend:
      access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
      secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
      dynamodb_table: "{{ lookup('env', 'VIA_DEPLOY_AWS_DYT') }}"
      bucket: "{{ lookup('env', 'VIA_DEPLOY_AWS_S3B') }}"
      region: "{{ lookup('env', 'VIA_DEPLOY_AWS_S3BR') }}"
      encrypt: true
    provider:
      access_key: "{{ lookup('env', 'AWS_ACCESS_KEY_ID') }}"
      secret_key: "{{ lookup('env', 'AWS_SECRET_ACCESS_KEY') }}"
    configs:
      cns:
        version: 12.2
        git_ref: ebf03a899f13a0adaaa3d9f31a299cff6ab33eb3
        override_values:
          cns_nvidia_driver: yes
          gpu_driver_version: '580.65.06'
      access_cidrs:
      - 'my-org-ip-cidr'    ### Make sure to update this e.g. 'xxx.xxx.xxx.xxx/32' for a single IP address.
      region: 'us-west-2'   ### Update this to change the deployment region
      ssh_public_key: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
      ssh_private_key_path: "{{ lookup('env', 'HOME') + '/.ssh/id_rsa' }}"
      additional_ssh_public_keys:
      ### Add public key of another system e.g. your colleague's if you are working in a team else add your own.
      - "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_rsa.pub') }}"
      clusters:
        app:
          private_instance: false
          master:
            ### 8xH100 HBM3. Modify this to change the GPU type. Alternatives:
            ### - p4d.24xlarge (8 x A100)
            ### - g6e.48xlarge (8 x L40S)
            type: 'p5.48xlarge'
            # az: 'us-west-2c'  ### Update this to change the availability zone for the deployment.
            labels: {}
            taints: []
            # capacity_reservation_id: 'cr-foobar'   ### Add this if you are using a capacity reservation.
            # is_capacity_block: false  ### Set this to true if using capacity_reservation_id and reservation is of type "Capacity Block for ML"
#         nodes:
#           <key>:
#             type: 'p5.48xlarge'
#             az: 'us-west-2c'
#             labels: {}
#             taints: []
#             capacity_reservation_id: 'cr-foobar'
#             is_capacity_block: false
          ports:
            backend:
              port: 30081
            frontend:
              port: 30082
          features:
            cns: true
            platform: true
            app: true
  platform:
    configs:
      namespace: 'default'
  app:
    configs:
      namespace: 'default'
      backend_port: 'backend'
      frontend_port: 'frontend'
      ngc_api_key: "{{ lookup('env', 'NGC_API_KEY') }}"
      openai_api_key: "{{ lookup('env', 'OPENAI_API_KEY') }}"
      db_username: 'neo4j'
      db_password: "{{ lookup('env', 'VIA_DB_PASSWORD') | default('password') }}"
      arango_db_username: 'root'
      arango_db_password: "{{ lookup('env', 'ARANGO_DB_PASSWORD') | default('password') }}"
      minio_access_key: 'minio'
      minio_secret_key: "{{ lookup('env', 'MINIO_SECRET_KEY') | default('minio123') }}"
      vss_chart:
        repo:
          name: 'nvidia-blueprint'
          url: 'https://helm.ngc.nvidia.com/nvidia/blueprint'
        chart: 'nvidia-blueprint-vss' # repo should be removed/commented-out when using local charts
        version: '2.4.0'
        #override_values_file_absolute_path: '/home/user/vss-values.yml'

Note

The above is just a reference. Make a copy of config-template.yml and update it as required.
Make sure to update config sections (spec/infra/configs/clusters/app/master and /nodes for multi-node deployments):
- region (spec/infra/configs/), the node type, capacity_reservation_id
- is_capacity_block is true if using Capacity Blocks for ML type capacity reservation, otherwise it is false or undefined.
- az (optional) It is needed when using capacity_reservation and it is bound to a particular availability zone.
- access_cidrs: run echo `curl ifconfig.me`/32 to get the machine’s IP range.
- override_values_file_absolute_path (optional) in the config file: To use the overrides.yaml file as discussed in sections like Configuration Options.

To use a Helm overrides values file to customize the various parts of the VSS blueprint deployment:

Uncomment and update override_values_file_absolute_path shown above in config.yml to set the actual path to the overrides file.
Uncomment line - "{{ configs.vss_chart.override_values_file_absolute_path }}" in dist/app-tasks.yml near the end of the file.
More information on the VSS Helm overrides file, Configuration Options.

Attributes of the config-template.yml:

Attribute	Optional	Description
name		A unique name to identify the infrastructure resources being created by.
spec > infra > backend > access_key		AWS access key ID used to access the backend bucket and table.
spec > infra > backend > secret_key		AWS secret access key used to access the backend bucket and table.
spec > infra > backend > dynamodb_table		Name of the AWS DynamoDB table used to manage concurrent access to the state.
spec > infra > backend > bucket		Name of the AWS S3 bucket in which state of the resources provisioned is stored.
spec > infra > backend > region		AWS region where state S3 bucket and Dynamo DB table are created.
spec > infra > backend > encrypt		Whether to encrypt the state while stored in S3 bucket.
spec > infra > provider > access_key		AWS access key ID used to provision resources.
spec > infra > provider > secret_key		AWS secret access key used to provision resources.
spec > infra > configs > cns	yes	CNS configurations.
spec > infra > configs > cns > version	yes	The version of CNS to install on the clusters. Defaults to 11.0.
spec > infra > configs > cns > override_values	yes	CNS values to override while setting up a cluster.
spec > infra > configs > cns > override_values > cns_value	yes	The value of the cns_value found in cns_values.yaml.
spec > infra > configs > access_cidrs		List of CIDRs from which app will be accessible. 1) echo `curl ifconfig.me`/32 of the users controller machine is automatically added to the list of access_cidrs. Add this for other machines that need access to the deployment service. However, if this outbound IP changes, the whitelist will not work. 2) The recommended solution is to ask the organization IT admins for the outbound IPs of users other than the one running the deployment. Note: It may or may not be a /32 CIDR
spec > infra > configs > region		AWS region in which to bring up the resources.
spec > infra > configs > ssh_private_key_path		Absolute path of the private key to be used to SSH the hosts.
spec > infra > configs > ssh_public_key		Content of the public counterpart of the private key used to SSH the hosts.
spec > infra > configs > additional_ssh_public_keys	yes	List of contents of public counterparts to the additional keys that will be used to SSH the hosts.
spec > infra > configs > bastion	yes	Details of the AWS instance to be used as a bastion host for private clusters.
spec > infra > configs > bastion > type	yes	AWS instance type for the bastion node (if required). Defaults to t3.small.
spec > infra > configs > bastion > az	yes	AWS availability zone in the region for the bastion node (if required). Defaults to the first (alphabetically) AZ of the region.
spec > infra > configs > bastion > disk_size_gb	yes	Root volume disk size for the bastion node. Defaults to 128.
spec > infra > configs > clusters		Definitions of clusters to be created.
spec > infra > configs > clusters > cluster		Unique key to identify a cluster. There can be 1 or more clusters.
spec > infra > configs > clusters > cluster > private_instance	yes	If true, creates the cluster instances within a private subnet. Defaults to false
spec > infra > configs > clusters > cluster > master		Definitions of the master node of the cluster.
spec > infra > configs > clusters > cluster > master > type	yes	AWS instance type for the master node. Defaults to p5.48xlarge.
spec > infra > configs > clusters > cluster > master > az	yes	AWS availability zone in the region for the master node. Defaults to the first (alphabetically) AZ of the region.
spec > infra > configs > clusters > cluster > master > disk_size_gb	yes	Root volume disk size for the master node. Defaults to 1024.
spec > infra > configs > clusters > cluster > master > labels	yes	Labels to apply to the master node. Defaults to {}.
spec > infra > configs > clusters > cluster > master > taints	yes	Taints to apply to the master node. Defaults to [].
spec > infra > configs > clusters > cluster > master > capacity_reservation_id	yes	The capacity reservation ID to use to bring up the master node.
spec > infra > configs > clusters > cluster > master > is_capacity_block	yes	Whether the provided capacity reservation ID for master node is from a Capacity Block.
spec > infra > configs > clusters > cluster > nodes	yes	Definitions of nodes of the cluster. Set to {} if no extra nodes other than master needed.
spec > infra > configs > clusters > cluster > nodes > node		Unique key to identify a node. There can be 0 or more nodes.
spec > infra > configs > clusters > cluster > nodes > node > type	yes	AWS instance type for the node node. Defaults to p5.48xlarge.
spec > infra > configs > clusters > cluster > nodes > node > az	yes	AWS availability zone in the region for the node node. Defaults to the first (alphabetically) AZ of the region.
spec > infra > configs > clusters > cluster > nodes > node > disk_size_gb	yes	Root volume disk size for the node node. Defaults to 1024.
spec > infra > configs > clusters > cluster > nodes > node > labels	yes	Labels to apply to the node node. Defaults to {}.
spec > infra > configs > clusters > cluster > nodes > node > taints	yes	Taints to apply to the node node. Defaults to [].
spec > infra > configs > clusters > cluster > nodes > node > capacity_reservation_id	yes	The capacity reservation ID to use to bring up the node node.
spec > infra > configs > clusters > cluster > nodes > node > is_capacity_block	yes	Whether the provided capacity reservation ID for node node is from a Capacity Block.
spec > infra > configs > clusters > cluster > ports	yes	Definitions of ports of the cluster. Set to {} if no ports are exposed by the cluster.
spec > infra > configs > clusters > cluster > ports > port		Unique key to identify a port. There can be 0 or more ports.
spec > infra > configs > clusters > cluster > ports > port > port		The port number of the port.
spec > infra > configs > clusters > cluster > ports > port > protocol	yes	The protocol of the port. Defaults to http.
spec > infra > configs > clusters > cluster > ports > port > path	yes	The path of the application on the port for the landing URL. Defaults to /.
spec > infra > configs > clusters > cluster > features	yes	Definitions of features of the cluster. Set to {} if no features defined for the cluster.
spec > infra > configs > clusters > cluster > features > feature		Key to identify a feature and value represents enabled/disabled by setting it to true/false. There can be 0 or more features.
spec > platform > configs > namespace	yes	Namespace to deploy the platform components in. Defaults to default.
spec > app > configs > namespace	yes	Namespace to deploy the app in. Defaults to default.
spec > app > configs > backend_port		Identifier of the port in the cluster to expose the API over.
spec > app > configs > frontend_port		Identifier of the port in the cluster to expose the UI over.
spec > app > configs > ngc_api_key		NGC API key used to download application charts, models, and containers.
spec > app > configs > openai_api_key		OPENAI API key used by the application.
spec > app > configs > db_username		The username used to access the DB.
spec > app > configs > db_password		The password used to access the DB.
spec > app > configs > arango_db_username		The username used to access the Arango DB.
spec > app > configs > arango_db_password		The password used to access the Arango DB.
spec > app > configs > minio_access_key		The access key used to access the Minio DB.
spec > app > configs > minio_secret_key		The secret key used to access the Minio DB.
spec > app > configs > vss_chart		Configuration details of the VSS chart.
spec > app > configs > vss_chart > repo	yes	Helm repo details of the chart. Can be ignored if using a local chart.
spec > app > configs > vss_chart > repo > name		Name provided to refer the added Helm repo.
spec > app > configs > vss_chart > repo > url		Url of the helm repo containing the chart.
spec > app > configs > vss_chart > chart		The name of the chart for a remote repo source. The absolute path of a local chart.
spec > app > configs > vss_chart > version		The version of the chart.

Run OneClick Script to Deploy on AWS#

Make sure the host machine where the OneClick script will run has RSA keys generated. If not, use the following command to generate them:
sudo apt-get install -y openssh-client ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
Deployment

Use config from Prepare env Variables. Place it in the dist directory.
Choose alternate instance type:
- The default instance type is: 8xH100 HBM3.
- You can change this to other configurations. This change can be made in the config.yaml file, which is available in Prepare env Variables.
### Default is: 8xH100 HBM3. ### Modify this to change the GPU type. ### Alternatives: ### - p4d.24xlarge (8 x A100) ### - g6e.48xlarge (8 x L40S) type: 'p5.48xlarge'
More info on the instance types available can be found in the AWS EC2 documentation here.

Deploy

cd dist/
./envbuild.sh install -f config.yml -c all

Note

For an error like could not process config file: ... while restarting or redeploying, try removing the temporary directory that is shown in the error logs. Example: rm -rf <dist-directory>/tmp.dZM7is5HUC.

Confirm SSH key was generated/present on the path mentioned in the config for keys - ssh_public_key``or ``ssh_private_key_path. If there is no need for additional_ssh_public_keys, comment it out in the config file. By default it is enabled to use $HOME/.ssh/my-colleague-1.pub.
With AWS deployments, for authentication errors like api error AuthFailure: AWS was not able to validate the provided access credentials:

Check your system clock and make sure its in synch with NTP. If the synch is off, AWS authentication errors like this are observed.
This project downloads and installs additional third-party open source software projects. Review the license terms of these open source projects before use.

Access the Deployment#

After success, verify that the logs are similar to the following:

access_urls:
app:
    backend: http://<NODE-IP>:30081/
    frontend: http://<NODE-IP>:30082/
ssh_command:
app:
    master: ssh -i $HOME/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@<NODE-IP>

Note

You must wait until the deployment installation is fully complete before trying to access the nodes.

Verify this info after successful deployment on demand using the command:
cd dist/ ./envbuild.sh -f config.yml info
Wait for all pods and services to be up. Log in to the node using the ssh command shown above and check the pod status using kubectl get pod.
ssh -i $HOME/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@<NODE-IP> kubectl get pod
Make sure all pods are in Running or Completed STATUS and show 1/1 as READY as shown below.

Note

The terraform scripts will install the kubectl utility. You must not install kubernetes or kubectl manually.
Make sure that the VSS API and UI are ready and accessible, check the logs for deployment using:
kubectl logs vss-vss-deployment-POD-NAME

Make sure that the following logs are present and that you do not observe any errors:

VIA Server loaded
Backend is running at http://0.0.0.0:8000
Frontend is running at http://0.0.0.0:9000

The VSS API and UI are now ready to be accessed at http://<NODE-IP>:30081 and http://<NODE-IP>:30082 respectively. Test the deployment by summarizing a sample video.

Teardown#

Un-installing:

$ cd dist/
$ ./envbuild.sh uninstall -f config.yml -c all

Common Issues#

vCPU Limit Exceeded#

If the AWS account has a lower vCPU quota than required by the instance type requested, you might observe a “vCPU Limit Exceeded” error.

Refer to https://repost.aws/knowledge-center/ec2-on-demand-instance-vcpu-increase for increasing the vCPU quota.
Refer to https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html#quota-cli-increase for AWS CLI instructions.

VSS Pod is Failing and Restarting on L40/L40S Node#

The VSS container startup might be timing out an L40/L40S node when VILA1.5 is used as the VLM (default). Try increasing the startup timeout by using an overrides file with following values:

vss:
  applicationSpecs:
    vss-deployment:
      containers:
        vss:
          startupProbe:
            failureThreshold: 360

Error Acquiring State Lock#

If the deployment process is interrupted and you attempt to run the deployment script again, you might encounter the following error:

│ Error: Error acquiring the state lock
│
│ Error message: operation error DynamoDB: PutItem, https response error
│ StatusCode: 400, RequestID:
│ 7RBDR4H3BTG8EHNIQE1DSPQ6BFVV4KQNSO5AEMVJF66Q9ASUAAJG,
│ ConditionalCheckFailedException: The conditional request failed
│ Lock Info:
│   ID:        <some-id>
│   Path:      <some-path>
│   Operation: OperationTypeApply
│   Who:       <some-user>
│   Version:   <some-version>
│   Created:   <some-timestamp>
│   Info:
│
│ OpenTofu acquires a state lock to protect the state from being written
│ by multiple users at the same time. Resolve the issue above and try
│ again. For most commands, you can disable locking with the "-lock=false"
│ flag, but this is not recommended.

To resolve this issue, modify the destroy_tf_shape function in the envbuild.sh script by adding a force-unlock command. Verify that the updated function looks like this:

function destroy_tf_shape() {
    echo "Destroying tf shape"
    abort_option
    switch_to_script_dir
    debug_out "Initializing tf"
    open_debug_banner
    $(tf_binary) -chdir="${tmp_dir}/iac" init -reconfigure -backend-config=tf-backend.json \
        >&$output_debug 2>&1
    # Add this line with the updated value for <lock-id-as-seen-in-error>
    echo "yes" | $(tf_binary) -chdir="${tmp_dir}/iac" force-unlock "<lock-id-as-seen-in-error>" >&$output_info 2>&1
    # rest of the function remains the same

Note

The lock-id-as-seen-in-error is the ID of the lock that is causing the error.

After modifying the script, follow the un-installation steps to force-unlock and destroy the deployment. After the un-installation is complete, revert the changes you made to the script and re-run the deployment script to create a fresh deployment.

Amazon Web Services (AWS)#

Prerequisites#

Host System Prerequisites#

AWS Prerequisites#

Install OneClick Deployment Package#

Prepare env Variables#

Prepare the config File#

Run OneClick Script to Deploy on AWS#

Access the Deployment#

Teardown#

Common Issues#

vCPU Limit Exceeded#

VSS Pod is Failing and Restarting on L40/L40S Node#

Error Acquiring State Lock#

Prepare `env` Variables#

Prepare the `config` File#