Bare Metal#

Warning

Instructions on this page are focussed on helping Developer bring up Tokkio environment quickly on BareMetal for developement and testing.

Introduction#

Purpose#

This document serves as a comprehensive resource for deploying Tokkio Pipeline on a Baremetal Machine using OneClick scripts. This guide aims to streamline the deployment process by providing detailed instructions from preparing necessary configuration templates to invoking OneClick scripts.

Scope#

While there are several possible ways to setup Tokkio Pipeline on a Baremetal, this document covers setting up Tokkio Pipeline on a Single Baremetal machine with necessary GPU Hardware in it.

Prerequisites#

Hardware#

Controller instance#

The Controller instance is where you will launch OneClick scripts from. Here are the necessary steps and requirements:

  • Operating System: Ensure that the instance is running Ubuntu 22.04.

  • SSH key Pair: Generate an SSH key pair to be used in later phase of OneClick scripts. You can follow these steps:

    • Open a terminal on your Controller instance

    • Run ssh-keygen to generate a new SSH key pair. You can specify the bit size for added security. e.g., ssh-keygen -b 4096 for a 4096-bit key.

    • Save the keys to the default location (~/.ssh).

  • Passwordless sudo access: Ensure the user on Control instance is enabled with password less sudo access.

    • You can test this by running sudo ls /root and this command shouldn’t prompt you for password.

    • If this is not setup, reach out to the system administrator to setup one.

Application instance#

Application instance or App instance is where the Tokkio Pipeline will run, requiring specific hardware and software configurations.

Requirements#

GPU

As per the flavor of Tokkio Pipeline you intend to run, refer this Reference Workflows.

Operating System

Ubuntu 22.04

/opt folder size

300 GB

/var/lib/containerd folder size

400 GB

  • Kubernetes Considerations: This instance should not have any prior Kubernetes installations running before you run the OneClick scripts for the first time.

  • Passwordless sudo access: Ensure the user on App instance is enabled with password less sudo access.

    • You can test this by running sudo ls /root and this command shouldn’t prompt you for password.

    • If this is not setup, reach out to the system administrator to setup one.

Access#

Access to Tokkio artifacts#

Ensure that you have access to all the artifacts used during bring up of Tokkio Pipeline Application. For e.g. Tokkio Application Helm chart on NGC.

Essential Skills and Background#

Familiarity with Command-Line-Interface (CLI)#

  • Basic Commands: Users should be comfortable with basic command-line operations, such as navigating directories, executing scripts, and managing files.

  • Environment Configuration: Understanding how the environment variables and how PATH setup works on Linux will greatly help operating OneClick script.

  • Scripting Basics: Basic scripting knowledge (e.g., shell scripting) is beneficial for understanding how the OneClick script operates and for troubleshooting any issues that may arise.

Familiarity with YAML#

  • YAML Syntax and Structure: YAML is often used for configuration files in cloud-native applications due to its readability and flexibility. The Configuration templates used in OneClick script uses YAML format. Users should be familiar with YAML syntax and structure.

Familiarity with Kubernetes eco system#

Tokkio pipeline is a Cloud native application, and uses concepts like Containerization, Kubernetes, helm etc. Users need to be familiar with these to get the best results from using OneClick scripts and the app.

  • Kubernetes Basics: Users should have basic understanding of Kubernetes core concepts such as pods, services and deployments

  • kubectl: Familiarity with the kubectl the command line tool used to interact with Kubernetes clusters including querying the status or logs of Running application pods etc., will.

  • Helm: Understanding Helm package manager for Kubernetes that simplifies application deployment by managing charts (Collections of pre-configured Kubernetes resource definitions). And how to use helm with override values will help configuring the templates appropriately.

General troubleshooting techniques#

  • Log analysis & troubleshooting: Users should be able to analyze logs generated by OneClick scripts to identify any errors or warnings and remediate the issue.

Additional considerations#

  • SSH key based login to App instance: Make sure Controller instance and Application instance are properly networked and Controller instance is configured to communicate with App instance using SSH passwordless login.

    • On Controller instance, run ssh-copy-id <app-instance-username>@<app-instance-ip> to copy the public key of Controller instance on to App instance.

Overall Security considerations#

The security of Tokkio in production environments is the responsibility of the end users deploying it. When deploying in a production environment, please have security experts review any potential risks and threats; define the trust boundaries, secure the communication channels, integrate AuthN & AuthZ with appropriate access controls, keep the deployment including the containers up to date, ensure the containers are secure and free of vulnerabilities.

Tokkio pipeline installation#

Download Oneclick scripts#

  • Once you clone ACE GitHub repo from NVIDIA/ACE.git, navigate to BareMetal directory.

    $ cd workflows/tokkio/scripts/one-click/baremetal

You should be able see envbuild.sh file at the root of this directory. We will be using this command to interact with OneClick scripts. General options of this command can be seen by running ./envbuild.sh

$ ./envbuild.sh
Usage: ./envbuild.sh (-v|--version)
   or: ./envbuild.sh (-h|--help)
   or: ./envbuild.sh (install/uninstall) (-c|--component <component>) [options]
   or: ./envbuild.sh (info) [options]

install/uninstall components:
-c, --component        one or more of all/infra/platform/app, pass arg multiple times for more than one

install/uninstall options:
-f, --config-file      path to file containing config overrides, defaults to config.yml
-i, --skip-infra       skip install/uninstall of infra component
-p, --skip-platform    skip install/uninstall of platform component
-a, --skip-app         skip install/uninstall of app component
-d, --dry-run          don't make any changes, instead, try to predict some of the changes that may occur
-h, --help             provide usage information

info options:
-f, --config-file      path to file containing config overrides, defaults to config.yml
-h, --help             provide usage information

Note

envbuild.sh with –component all option installs infra, platform and app components.

  • infra component is responsible for

    • Installation & Configuration of Kubernetes Cluster on App host and Turn Server flavour (Coturn/twilio) on turn host.

  • platform component is responsible for

    • Installing local-path-provisioner chart.

    • Installing metrics and logging related charts.

  • app component is responsible for

    • Installing required kubernetes namespace for tokkio chart.

    • Installing required kubernetes secrets for tokkio chart.

    • Installing tokkio chart.

    • Installing tokkio UI as static website using Nginx Server on app instance.

With help of envbuild.sh you can uninstall and re-install app component using below commands.

#Uninstall app component using below command
./envbuild.sh uninstall --component app --config-file ./<my-l40-config.yml>

#Install app component using below command
./envbuild.sh install --component app --config-file ./<my-l40-config.yml>

Prepare config-template file#

Make a copy of config-template.yml of your own choice. e.g. cp config-template.yml my-l40-config.yml You can populate the config file as based on definition of each attribute.

All the attributes of the config-template.yml are explained in below table.

Config template#

Parameter name

Type

Optional

Description

schema_version

string

Config-Template schema version

name

string

A unique name to identify the infrastructure resources being created by.

spec

map

Infrastructure and Application configuration.

spec > infra

string

Infrastructure configuration.

spec > infra > csp

string

cloud service provider name, in this case bm

spec > infra > backend

map

terraform backend configuration to store state of infrastructure, for bm it is managed locally

spec > infra > configs

map

Additional infrastructure configuration.

spec > infra > configs > cns

map

yes

Nvidia Cloud Native Stack configuration. More details on Cloud Native Stack can be found here NVIDIA/cloud-native-stack.

spec > infra > configs > cns > version

string

yes

The version of Nvidia Cloud Native Stack to install on the clusters. Defaults to 12.2.

spec > infra > configs > cns > git_ref

string

yes

The git commit hash of Nvidia Cloud Native Stack.by default take master branch’s latest commit hash.

spec > infra > configs > cns > override_values

map

yes

Nvidia Cloud Native Stack values to override while setting up a cluster.

spec > infra > configs > cns > override_values > cns_nvidia_driver

bool

yes

set to yes if want to install nvidia driver using runfile method otherwise no.

spec > infra > configs > cns > override_values > gpu_driver_version

string

yes

Config to override gpu_driver_version while installing Nvidia Cloud Native Stack.

spec > infra > configs > ssh_private_key_path

string

Absolute path of the private key to be used to SSH the hosts.

spec > infra > configs > ssh_public_key

string

Absolute path of the public key counterpart of private key used to SSH the hosts.

spec > infra > configs > additional_ssh_public_keys

list

yes

List of contents of public counterparts to the additional keys that will be used to SSH the hosts.

spec > infra > configs > clusters

map

Definition of cluster to be created.

spec > infra > configs > clusters > app

map

Definition of App cluster to be created.

spec > infra > configs > clusters > app > master

map

Definitions of the master node of the app cluster.

spec > infra > configs > clusters > app > master > user

string

ssh username of master node of app cluster.

spec > infra > configs > clusters > app > master > host

string

IP Address of master node of app cluster.

spec > infra > configs > clusters > app > ports

map

Definitions of ports to be exposed from app.

spec > infra > configs > clusters > app > ports > app

map

Definitions of app port.

spec > infra > configs > clusters > app > ports > app > port

number

Port number where app is running.

spec > infra > configs > clusters > app > ports > grafana

map

Definitions of grafana port.

spec > infra > configs > clusters > app > ports > grafana > port

number

Port number where grafana is running.

spec > infra > configs > clusters > app > ports > grafana > path

string

Path of grafana for landing url.

spec > infra > configs > clusters > app > ports > prometheus

map

Definitions of prometheus port.

spec > infra > configs > clusters > app > ports > prometheus > port

number

Port number where prometheus is running.

spec > infra > configs > clusters > app > ports > prometheus > path

string

Path of prometheus for landing url.

spec > infra > configs > clusters > app > ports > kibana

map

Definitions of kibana port.

spec > infra > configs > clusters > app > ports > kibana > port

number

Port number where kibana is running.

spec > infra > configs > clusters > app > ports > kibana > path

string

Path of kibana for landing url.

spec > infra > configs > clusters > app > features

map

Definitions of features flag of the app cluster.

spec > infra > configs > clusters > app > features > cns

bool

cns feature flag is always true as used to install Nvidia Cloud Native Stack.

spec > infra > configs > clusters > app > features > app

bool

app feature flag is always true as used to install tokkio app & other components.

spec > infra > configs > clusters > turn

map

Definition of master node of turn cluster.

spec > infra > configs > clusters > turn > master

map

Definitions of the master node of the turn cluster.

spec > infra > configs > clusters > turn > master > user

string

ssh username of turn node.

spec > infra > configs > clusters > turn > master > host

string

IP Address of turn node.

spec > infra > configs > clusters > turn > features

map

Definitions of features flag of the turn cluster.

spec > infra > configs > clusters > turn > features > coturn

bool

Always set to true.

spec > platform

map

Configuration to change the default foundational config to be used.

spec > platform > configs

map

Foundational configuration.

spec > platform > configs > k8s_namespace

string

yes

Kubernetes namespace for foundational charts to be deployed,defaults to platform

spec > platform > configs > k8s_secrets

list

List of kubernetes secrets needed for foundational chart.

spec > platform > secrets > ngc_cli_api_key

string

NGC cli_api_key used to download helm chart to be used.

spec > app > configs > app_settings

map

Configuration to change the default App setting to be used.

spec > app > configs > app_settings > k8s_namespace

string

yes

Kubernetes namespace for app chart to be deployed, defaults to app.

spec > app > configs > app_settings > helm_chart

map

yes

Helm chart config for app chart to be deployed.

spec > app > configs > app_settings > helm_chart > repo

map

yes

Configuration of remote repo used for app helm chart to be deployed.

spec > app > configs > app_settings > helm_chart > repo > enable

bool

yes

Flag to use app helm chart from remote repo, defaults to true.

spec > app > configs > app_settings > helm_chart > repo > repo_url

string

yes

Repo_url for app helm chart to be deployed,defaults to https://helm.ngc.nvidia.com/nvidia/ace.

spec > app > configs > app_settings > helm_chart > repo > chart_name

string

yes

App helm chart name to be fetched from remote repo, defaults to ucs-tokkio-app-base-3-stream-llm-rag-3d-ov.

spec > app > configs > app_settings > helm_chart > repo > chart_version

string

yes

App helm chart version to be fetched from remote repo,defaults to 4.1.4.

spec > app > configs > app_settings > helm_chart > repo > release_name

string

yes

Release name for app to be deployed using helm chart, defaults to tokkio-app.

spec > app > configs > app_settings > helm_chart > repo > user_value_override_files

list

yes

Absolute path of user override values.yml to be used for app chart deployment

spec > app > configs > app_settings > helm_chart > local

map

yes

Configuration to change app helm chart deployment using locally present chart.

spec > app > configs > app_settings > helm_chart > local > enable

bool

yes

true if want to use locally present app helm chart.

spec > app > configs > app_settings > helm_chart > local > path

string

yes

Absolute path of helm chart present locally

spec > app > configs > app_settings > helm_chart > local > release_name

string

yes

Release name for app to be deployed using helm chart, defaults to tokkio-app.

spec > app > configs > app_settings > helm_chart > local > user_value_override_files

list

yes

Absolute path of user override values.yml to be used for app chart deployment

spec > app > configs > app_settings > k8s_secrets

list

List of kubernetes secrets to be deployed.

spec > app > configs > turn_server_settings

map

yes

Configuration to change to setup turn server to be used for app.

spec > app > configs > turn_server_settings > coturn

map

yes

Configuration details of coturn as turn server to be used for app.

spec > app > configs > turn_server_settings > coturn > username

string

yes

Coturn server username used while setting up coturn,defaults to foo

spec > app > configs > turn_server_settings > coturn > password

string

yes

Coturn server password used while setting up coturn,defaults to bar

spec > app > configs > turn_server_settings > coturn > realm

string

yes

Realm name for coturn server,defaults to mydummyt.org

spec > app > configs > turn_server_settings > twilio

map

yes

Configuration details of twilio as turn server to be used for app.

spec > app > configs > turn_server_settings > twilio > account_sid

string

yes

account_sid from twilio account,defaults to empty string

spec > app > configs > turn_server_settings > twilio > auth_token

string

yes

auth_token from twilio account,defaults to empty string

spec > app > configs > ui_settings

map

yes

Configuration to change to override default UI

spec > app > configs > ui_settings > resource

map

yes

Configuration for UI resource to be used

spec > app > configs > ui_settings > resource > ngc

map

yes

Configuration of NGC to download UI resource from

spec > app > configs > ui_settings > resource > ngc > org

string

yes

NGC Organization of the UI resource to be used.

spec > app > configs > ui_settings > resource > ngc > team

string

yes

NGC Team of the UI resource to be used.

spec > app > configs > ui_settings > resource > ngc > name

string

yes

NGC Resource Name of the UI resource to be used.

spec > app > configs > ui_settings > resource > ngc > version

string

yes

NGC Resource Version of the UI resource to be used.

spec > app > configs > ui_settings > resource > ngc > file

string

yes

NGC Resource File Name of the UI resource to be used.

spec > app > configs > ui_settings > user_env_vars

map

yes

Configuration to override default UI settings

spec > app > secrets > ngc_cli_api_key

string

NGC cli_api_key used to download UI resource and helm chart to be used.

Note

It is recommended to set spec > infra > configs > cns > override_values > cns_nvidia_driver to yes to support installation of Nvidia driver with all latest kernel version.

Prepare environment variables#

The config template yml file contains several inputs about the infrastructure and application’s needs. For ease of use, some of these are wired to lookup environment variables. For example {{ lookup(‘env’, ‘NGC_CLI_API_KEY’) }} are doing lookup of environment to using lookup function. What this means is, we can set an environment variable for NGC_CLI_API_KEY with its value and OneClick Script can access it automatically.

Prepare a file to hold these environment variables and their values by vi my-env-file.env and populate them with actual values. Example shown below. config-template.yml

cat my-env-file.env
export OPENAI_API_KEY="<replace-with-actual-value>"
export NGC_CLI_API_KEY="<replace-with-actual-value>"
export NVIDIA_API_KEY="<replace-with-actual-value>"
export APP_HOST_IPV4_ADDR="<replace-with-actual-value>"
export APP_HOST_SSH_USER="<replace-with-actual-value>"
export COTURN_HOST_IPV4_ADDR="<replace-with-actual-value>"
export COTURN_HOST_SSH_USER="<replace-with-actual-value>"

Note

Currently we are supporting installation of turn servers (coturn/twilio) on same machine as App instance. So we need only one machine. So value of APP_HOST_IPV4_ADDR and COTURN_HOST_IPV4_ADDR is same. also value of APP_HOST_SSH_USER and COTURN_HOST_SSH_USER is same. you can refer sample config-template.yml from folder dist/config-template-examples.

Use the source command to load these variables into your current shell session. The source command reads and executes commands from the specified file in the current shell environment, making the variables defined in the file available in the shell.

Caution

If you modify your <my-env-file.env> file or started a new shell, you will have to run source <my-env-file.env> again before running ./envbuild.sh command.

Installing#

  • Running OneClick script

    ./envbuild.sh install --component all --config-file ./<my-l40-config.yml>
    
  • Once the script installation completes, capture the access_urls section for future reference. This specifies various URLs that have been configured as part of installation. Example output is shown below.

    access_urls:
    app:
       app: http://<app-instance-ip-address>:80/
       grafana: http://<app-instance-ip-address>:32300/login
       kibana: http://<app-instance-ip-address>:31565/app/kibana
       prometheus: http://<app-instance-ip-address>:30090/graph
    turn: {}
    ssh_command:
    app:
       master: ssh -i /home/my-user/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null <username>@<app-instance-ip-address>
    turn:
       master: ssh -i /home/my-user/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null <username>@<app-instance-ip-address>
    
  • Verifying installation: Once the installation steps complete, it may take several minutes (~60 mins) depending on Model initialization and other Application specific initialization activities and network speed. Use the below steps to check if the application is up before accessing the UI. On Application instance, run kubectl get pods -n <application-namespace>. Example output of this command is shown below.

    $ kubectl get po -n app
    NAME                                                        READY   STATUS    RESTARTS      AGE
    a2f-a2f-deployment-98c7fb777-x6rz5                          1/1     Running   0             1h10m
    ace-agent-chat-controller-deployment-0                      1/1     Running   0             1h10m
    ace-agent-chat-engine-deployment-784444785b-7kc72           1/1     Running   0             1h10m
    ace-agent-plugin-server-deployment-65956c5d5d-jlxfk         1/1     Running   0             1h10m
    anim-graph-sdr-envoy-sdr-deployment-5b7cc55b6b-7rj95        3/3     Running   0             1h10m
    chat-controller-sdr-envoy-sdr-deployment-78b54b7f86-spshn   3/3     Running   0             1h10m
    ds-sdr-envoy-sdr-deployment-85bbfdb4c4-8lzch                3/3     Running   0             1h10m
    ds-visionai-ds-visionai-deployment-0                        1/1     Running   0             1h10m
    ia-animation-graph-microservice-deployment-0                1/1     Running   0             1h10m
    ia-omniverse-renderer-microservice-deployment-0             1/1     Running   0             1h10m
    mongodb-mongodb-666765487c-msh74                            1/1     Running   0             1h10m
    occupancy-alerts-api-app-84576db5c9-7z4wm                   1/1     Running   0             1h10m
    occupancy-alerts-app-5cfcc9f75-zpmtz                        1/1     Running   0             1h10m
    redis-redis-79c99cdd97-7wk5b                                1/1     Running   0             1h10m
    redis-timeseries-redis-timeseries-69bb884965-xmppg          1/1     Running   0             1h10m
    renderer-sdr-envoy-sdr-deployment-99f99d458-4hcql           3/3     Running   0             1h10m
    riva-speech-547fb9b8c5-7rzn6                                1/1     Running   0             1h10m
    tokkio-ingress-mgr-deployment-86897998cc-gn4np              3/3     Running   0             1h10m
    tokkio-ui-server-deployment-7f4bc5c5ff-td2g2                1/1     Running   0             1h10m
    tokkio-umim-action-server-deployment-674cccc898-f6zxn       1/1     Running   0             1h10m
    triton0-bbd77d78f-w496x                                     1/1     Running   0             1h10m
    vms-vms-67876bcb9b-vtm8h                                    1/1     Running   0             1h10m
    

Validating#

  • Browser settings: This installation doesn’t setup any CA certificates on Baremetal installation. To overcome Browser restrictions, we need to enable some flags.

    • On the machine where you want to access the UI, open Chrome browser and navigate to chrome://flags. Follow equivalent steps if you are using any other browser.

    • Search for Insecure origins treated as secure setting

    • Select Enable

    • Set/add the following chrome flags under it. http://<your-app-instance-ip>:80, http://<your-app-instance-ip>:30888, ws://<your-app-instance-ip>:30888

    • Follow the Chrome instructions to Relaunch browser and complete this step.

  • Access the app: Once all the pods have come to Ready status, you can access the Tokkio pipeline UI at http://<your-app-instance-ip>:80.

    • Handling Site is not secure Warning. This warning arises since we are not using any CA signed certificate. Confirm the IP address and port, read and take appropriate action to continue to site.

    • Granting permissions to browser. For the first time, browser should prompt permissions such as Mic, Speaker or Camera necessary for UI to operate. Upon accepting the permissions, UI should load.

Application UI

Un-installing#

In case if you choose to uninstall the Application and UI the OneClick script has installed, you may choose to run the uninstall command with appropriate options. Example shown below to uninstall all components.

./envbuild.sh uninstall --component all --config-file ./my-l40-config.yml

Known-Issues#

  1. If you are setting spec.app.configs.app_settings.k8s_namespace = default, uninstallation of app component using envbuild.sh throws below error.

    • fatal: [app-master]: FAILED! => {"changed": false, "error": 403, "msg": "Namespace default: Failed to delete object: b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"namespaces \\\\\"default\\\\\" is forbidden: this namespace may not be deleted\",\"reason\":\"Forbidden\",\"details\":{\"name\":\"default\",\"kind\":\"namespaces\"},\"code\":403}\\n'", "reason": "Forbidden", "status": 403}

    To prevent this, avoid using default kubernetes namespace for spec.app.configs.app_settings.k8s_namespace

  2. Reinstallation to tokkio chart on same cluster is not supported currently.