This section describes how to deploy Cluster Bring-Up Web on a Linux machine.

Prerequisites

Python 3.6 or greater is required on the host where the framework is to be deployed.

System Requirements

The system that runs the cluster bring-up framework must satisfy the following requirements:

At least 4GB of memory
At least 2 CPU cores
At least 30GB of space
Running Kubernetes

Supported Operating Systems

CentOS 8 or later 64-bit (x86)
Red Hat Enterprise Linux 8.2 or later 64-bit (x86)
Ubuntu 20.04 or later 64-bit (x86)

Supported Deployment Platforms

NVIDIA currently supports running cluster bring-up framework as a containerized application using Docker images deployed to a Kubernetes cluster.

In the following sections, you'll find deployment details and instructions for a Kubernetes platform.

Deploying via Kubernetes

This section describes how to deploy Cluster Bring-Up WEB in Kubernetes cluster.

Installation

The installation is performed by using a virtual machine (VM) image which includes COT with all its dependencies.

This section shows how to install the and deploy the cluster-bring up in offline mode which requires the user to download and restore a machine image with most of the dependencies already located on the machine.

Prerequisites

The following is a list of requirements that must be met:

Clonezilla version 3.0.1.8

Installation Steps

For offline installation, perform the following steps:

Download the tar image file located here.
Move the download file to the data center and untar the file.
Restore the image on your machine via Clonezilla. See section Restore Image for procedure.
Log into the installation machine as the root user with the password "password".

Make sure Kubernetes is running in Ready status:

Copy
Copied!

            
            $ kubectl get nodes
NAME                 STATUS   ROLES                  AGE   VERSION
ib-node-01-cot       Ready    control-plane,master   39m   v1.24.2+k3s1

Change directory to the location of the installation script located under /cot:

Copy
Copied!

            
            $ pwd
/cot
$ ls -la
total 24
drwxr-xr-x  4 root root 4096 Jul 21 14:36 .
drwxr-xr-x 21 root root 4096 Jul 21 15:03 ..
drwxr-xr-x  3 root root 4096 Jul 21 14:35 ansible
drwxr-xr-x  5 root root 4096 Jul 21 14:36 installer
-rwxr-xr-x  1 root root  583 Jul 21 14:36 install.sh
-rwxr-xr-x  1 root root  393 Jul 21 14:36 uninstall.sh

Run the installation script with the --offline-mode flag:

Copy
Copied!

            
            $ ./install.sh --offline-mode
Installing cluster-bringup-service
Installing awx-operator
Installing awx-cluster-bringup
AWX is currently upgrading
Importing AWX resources
Installation finished successfully
AWX interface URL: http://10.43.144.44:80
AWX username: admin
AWX password: SxFLNsjpjAuoUICJDl0XUvdjDDmQmBWf
AWX OAuth token: jacb63Ac3bzyXXTTzsYbzAdA1mymaP
API URL: http://cluster-bringup:5000/api
PyPI URL: http://cluster-bringup:5000/pypi/simple/
Downloads URL: http://cluster-bringup:5000/downloads/
Files folder: /opt/nvidia/cot/files

Restore Image

As part of the installation process, an image with Kubernetes and AWX-Operator already present must be restored on a machine. To restore, the Clonezilla software must be utilized.

Restore VM Using Hypervisor

The Clonezeilla restoration procedure can also be used for virtualization.

The following subsections provide the list of virtualization solutions that are supported.

KVM

Kernel-based Virtual Machine, or KVM, is a full virtualization solution for Linux on x86 hardware containing virtualization extensions. Using KVM, users can run multiple VMs running unmodified Linux or Windows images. Each VM has private virtualized hardware: A network card, disk, graphics adapter, etc.

Dependencies

The following is a list of required dependencies:

virt-manager application

Restoration Steps

Follow these steps to restore the image on a VM. Each step has a name prepended to the step which indicated from which machine to perform the action:

On the machine running a hypervisor, check if there is enough space in the root and /images directories.

Copy
Copied!

            
            $ df -h
Filesystem        Size  Used Avail Use% Mounted on
devtmpfs           91G     0   91G   0% /dev
tmpfs              91G     0   91G   0% /dev/shm
tmpfs              91G   35M   91G   1% /run
tmpfs              91G     0   91G   0% /sys/fs/cgroup
/dev/sda2          44G  8.1G   33G  20% /
/dev/sda1         2.0G  226M  1.6G  13% /boot
/dev/sda5         392G   99G  274G  27% /images
tmpfs              19G  4.0K   19G   1% /run/user/0
l1:/vol/s1        2.9T  2.0T  931G  69% /auto/s1

On the machine running the hypervisor, download Clonezilla ISO and move it to the /tmp directory.

Copy
Copied!

            
            $ ll /tmp
total 396484
-rw-r--r--   1 qemu qemu 379584512 Jul 17 16:06 clonezilla-live-3.0.1-8-amd64.iso

On the machine running the hypervisor, create a new directory in the /images directory with the name of the newly created machine.

On the machine running the hypervisor, create a disk image with 65G.

Copy
Copied!

            
            $ qemu-img create -f raw /images/<machine_name>/<machine_name>-bk-disk.img 65G
Formatting '/images/<machine_name>/<machine_name>-bk-disk.img', fmt=raw size=69793218560

On the machine running the hypervisor, open the Virtual Manager GUI.

Copy
Copied!

            
            $ virt-manager

In the Virtual Manager GUI, click the "Create a virtual machine" icon on the top left.
Create a new VM (5 steps):
1. Select "Local install media".
2. For "Choose ISO", select the Clonezilla ISO placed in /tmp, uncheck "Automatically detect from the installation media", type and select the OS of choice (must be supported).
3. Memory: 4096; CPUs: 2
4. For "Select or create custom storage" and browse to the image disk created earlier.
5. Type in a unique machine name and check the "Customize configuration before install" box
6. Click "Finish".
In the Virtual Manager GUI, change the boot order:
1. Open the settings of the VM you are restoring on.
2. Boot Options.
3. Check the "Clonezilla CDROM" box which is linked to the Clonezilla ISO from step 2 above.
4. Click the up arrow to move it up in the boot order.
5. Click "Apply".
6. Click "Begin Installation".
After restarting the machine, the Clonezilla software will boot. Follow these steps to successfully restore the image:
1. Clonezilla live.
2. English.
3. Keep.
4. Start.
5. device-image.
6. ssh_server.
7. dhcp.
8. Type the IP address of the machine which stores the untar file from step 2 of section "Installation Steps".
9. Port stays at "22" (default ssh).
10. Keep "root" as user.
11. Type the directory path which stores the untar file from step 2 of section "Installation Steps".
12. Type password to root user.
13. Mode: Beginner.
14. restoredisk.
15. Select the name of your image.
16. Select the name of your storage.
17. Yes, check.
18. Power off.
In the Virtual Manager GUI, select "Change Boot Order". Then move disk image created in step 4 to the top of the list ahead of Clonezilla (CDROM).
In the Virtual Manager GUI, select "Force off" and "Start VM".
After booting, log in as root user with the password "password".

(Restore) Change the name of the machine since it has the cloned machine name configured.

Copy
Copied!

            
            $ vi /etc/hostname
$ vi /etc/hosts

(Restore) If no Internet access is available on the machine, change the network interface in use.

Copy
Copied!

            
            $ ifconfig -a
$ ethtool ens3 # Link detected: no
$ dhclient 
$ ethtool ens3
$ ifconfig -a
$ vim /etc/netplan/00-installer-config.yaml

(Restore) Reboot machine → reboot.

Restore on Bare Metal

This section explains how to restore the image on a physical computer server.

Supported Hardware

ProLiant DL380p Gen8

Restoration Steps

Connect to machine's remote management, ILO for HPE.
Mount/add Clonezilla ISO via: Virtual Drives → Image File CDROM → Select Clonezilla ISO
Reset the machine: Power Switch → Reset.
Boot via Clonezilla ISO: Press F11 on startup → select CDROM Clonezilla ISO for boot.
Continue from step 9 of section Restore on VM Machine to the end.

Warning

For additional information on HPE's remote management, visit HPE's support website.

Installation Script

The installation script, install.sh, performs the following operations:

Creates a new virtual environment for installation
Ensures the dependencies for the installer are installed
Deploys cluster bring-up WEB framework on Kubernetes platform
Deploys cluster bring-up AWX framework on Kubernetes platform
Configures AWX resources for cluster orchestration

Usage:

Warning

Make sure to be located in the folder of the installation script (under /cot).

Copy
Copied!

            
            ./install.sh [OPTIONS]

The following options are available for the installation script:

Option	Description
`--hostfile`	Specify path to hosts file that contains hostnames for the inventory
`--hostname`	Specify end-host list expression that represents hostnames for the inventory
`--ib-host-manager`	Specify hostname to be a member of the `ib_host_manager` group
`--username`	Specify username to authenticate against the hosts
`--password`	Specify password (encoded in base64) to authenticate against the hosts
`--offline-mode`	Specify to run the installation script in offline mode. Supported only when using COT image.
`--config_file`	Specify the path to the configuration file to incorporate into the installation

For example:

Copy
Copied!

            
            $ ./install.sh --hostname ib-node-0[1-2,5] --ib-host-manager ib-node-01
 
Installing cluster-bringup-web
Installing awx-operator
Installing awx-cluster-bringup
AWX is currently upgrading
Importing AWX resources
Installation finished successfully
AWX interface URL: http://cluster-bringup:31873
AWX username: admin
AWX password: NDaXP7ULFjoHdxNwEYxLPRYx6PNWxwoX
AWX OAuth token: ihj219yX6w5cpmgqvHy923nyQTjuoB
API URL: http://cluster-bringup:5000/api
PyPI URL: http://cluster-bringup:5000/pypi/simple/
Downloads URL: http://cluster-bringup:5000/downloads/
Files folder: /opt/nvidia/cot/files

In this example, 3 hosts named ib-node-01, ib-node-02, and ib-node-05 are added to the inventory.

In addition, the ib-node-01 host configured to be a member of the ib_host_manager group for the In-Band operations.

Configuration File

This section provides the required information to build a YAML configuration file and applying it to COT.

The YAML configuration file consists of key-value pairs and can be applied by providing it to the installation script when installing COT, or by using the COT API on an installed COT environment.

A full configuration file with defaults and notes is provided with the COT and is located in the main COT directory.

Example for using the configuration file flag with the installation script:

Copy
Copied!

            
            $ ./install.sh --config_file /PATH/TO/YAML

For applying the configuration file to an existing COT environment, see section "COT API Apply".

Pass-Fail Criteria

To define specific pass/fail criteria, the pass_fail_criteria variable can be utilized. This variable must consist of a dictionary as its value which will have a mapping of a job template (playbook name) to its user-defined criteria (dictionary). The criteria dictionary should contain two special keys, max_fail_percentage and action:

max_fail_percentage expects an integer from 0-100 as its value. The value represents the percentage of failures which are acceptable during the execution of the supported job template. Its default value is 0, which means that in the case of any failures (one host or more) the job template fails.
action defines the operation to perform if the actual failure percentage is greater than the max_fail_percentage value

Supported job template actions (operation types):

Action/Operation	Description
`stop`	Fails the execution of the job

Playbook name (key names supported for pass_fail_criteria) to job template name mapping:

Playbook Name	Job Template Name
`hca_fw_update`	HCA Firmware Update
`ib_hca_fw_update`	IB HCA Firmware Update
`ib_cable_fw_update`	IB Cable Firmware Update
`ib_switch_fw_update`	IB Externally Managed Switch Firmware Update
`mlnxos_configure`	MLNX-OS Configure
`mlnxos_upgrade`	MLNX-OS Upgrade
`ib_router_configuration`	IB NDR Router Configuration

Example for pass_fail_criteria variable example:

Copy
Copied!

            
            pass_fail_criteria:
  hca_fw_update:
    max_fail_percentage: 40
    action: stop
  ib_switch_fw_update:
    max_fail_percentage: 80
    action: stop

In this example, the user provides criteria for two job templates: HCA Firmware Update (hca_fw_update) and IB Externally Managed Switch Firmware Update (ib_switch_fw_update).

For the hca_fw_update job template, max_fail_percentage is set to 40. If there are 3 total hosts, if only one host fails, then the job template passes (33% actual failure which is smaller than 40%). If two hosts fail, the job template fails (66% actual failure which is greater than 40%).
For the ib_switch_fw_update job template, max_fail_percentage is set to 80. For this job template to fail, over 80% of the hosts must fail.

COT Summary Report Configuration

The COT Summary Report can be configured using the summary_report variable.

The following are the options for configuring the COT Summary Report:

Key	Variable Name	Description	Type	Default
`analysis_section`	`high_ber_table_limit`	The number of maximum ports (rows) to be displayed in the High BER table.	Integer	10

Key

Variable Name

Description

Type

Default

analysis_section

high_ber_table_limit

The number of maximum ports (rows) to be displayed in the High BER table.

Integer

Example:

Copy
Copied!

            
            summary_report:
  analysis_section:
    high_ber_table_limit: 12

COT Thresholds

This section describes how to configure thresholds for COT, using the thresholds variable.

By default, common thresholds are configured in the provided configuration file.

To add/modify thresholds, refer to the following table.

Each threshold described by a combination of the following keys:

Name	Description	Options
ASIC	ASIC of the device	Common ASIC types: `7nm`, `16nm`
Link active speed	The ACTIVE link speed	Common speed types: `NDR`, `HDR`, `EDR`
FEC Type	Name of the FEC type	Any string represents an FEC type (as presented in CollectX) or use `ALL_FEC` to set thresholds for all FEC types. Common FEC types: `No_FEC` `STD_RS-FEC_RS_528_514` `STD_LL_RS-FEC_RS_271_257` `STD_RS-FEC_RS_544_514` `RS-FEC_544_514_PLR` `LL-FEC_271_257_PLR` `ETH_Consortium_LL_50G_RS_FEC_PLR_272_257+1` `ALL_FEC`
Cable technology	Technology of the cable (copper/optic)	`DACs` – for copper `ACC` – for optic `Active` – for optic `ALL_CABLES`
Threshold type	Type of the threshold	`raw_ber` `effective_ber` `symbol_ber` `effective_error` `symbol_error`

The following is an example for a thresholds configuration section:

Copy
Copied!

            
            thresholds:
  7nm:
    NDR:
      STD_RS-FEC_RS_544_514:
        DACs:
          raw_ber: 1.00E-07
          effective_ber: 1.00E-13
          symbol_ber: 1.00E-13
          symbol_error: 0
        ACC:
          raw_ber: 1.00E-07
          effective_ber: 1.00E-13
          symbol_ber: 1.00E-13
          symbol_error: 0

Upgrading Framework Script

The upgrade.sh script upgrades the COT containers and configuration files, including the COT API itself, while preserving the existing data.

To upgrade the COT:

Download tar.gz upgrade file from the COT download center.
Extract the upgrade file.
Run the upgrade.sh script located in the extracted folder.

Example:

Copy
Copied!

            
            root@cot-server:/cot/upgrade_example# ./upgrade.sh 
Upgrading COT API
Building COT snapshot
Snapshot built successfully. Path: /tmp/cot_snapshot_26-03-23_08-28.tar.gz
Removing cluster-bringup-service
Removing local registry
Removing awx-cluster-bringup
Removing awx-operator
Installing awx-operator
Installing local registry
Installing awx-cluster-bringup
AWX is currently upgrading
Installing cluster-bringup-service
Importing snapshot
Removing snapshot /tmp/cot_snapshot_26-03-23_08-28.tar.gz
Successfully upgraded using /cot/upgrade_example/upgrade_data

COT API

This section details the operations that could be performed once the installation process concludes.

The following code block demonstrates all the available actions:

Copy
Copied!

            
            $ cot [-h] [-v] {install,update,show,uninstall}

Warning

The install and uninstall operations must be utilized via the install.sh and uninstall.sh scripts.

Update

The update command allows updating certain components of the Cluster Bring-up Tool.

Copy
Copied!

            
            $ cot update [-h] --cot_dir <PATH> {playbooks,awx_templates,cot_client}

Warning

The update command relies on the cot_dir argument, which refers to the path of the folder extracted from the tar.gz file given .

Mandatory arguments:

Arguments	Description
`--cot_dir`	Specify the path of the folder extracted from the new `tar.gz` file. The tool uses the data inside the folder as the new data for the update operation.

Arguments

Description

--cot_dir

Specify the path of the folder extracted from the new tar.gz file.

The tool uses the data inside the folder as the new data for the update operation.

Optional arguments:

Arguments	Description
`playbooks`	Update the ansible playbooks
`awx_templates`	Update the AWX templates (job templates and workflows). This updates the ansible playbooks as a pre-task.
`cot_client`	Update the COT client (on the `ib_host_manager` specified host)

Show

Usage:

Copy
Copied!

            
            $ cot show [-h] [--awx_info] [--file_server_info] [--api_url]

Options:

Option	Description
`--awx_info`	Get AWX URL and credentials
`--file_server_info`	Get file server URL and files folder
`--api_url`	Get the REST API URL

Export

The export operation allows creating a snapshot of the data within an existing COT environment. This may be used to transport the data between environments.

Usage:

Copy
Copied!

            
            cot export [-h] [--dest_path PATH] [--components {all,playbooks,file_server,database,awx} [{all,playbooks,file_server,database,awx} ...]]

Options:

Option	Description
`--dest_path`	Directory path to save the snapshot. Default: `/tmp`.
`--components`	List of components to export, separated by spaces. Default: `all`.

Example:

Copy
Copied!

            
            root@cot-server:/ # cot export --dest_path /tmp/example/ --components playbooks database

This command builds a snapshot containing the playbooks and the database of the current COT environment. The .tar.gz snapshot file produced is saved to /tmp/example/<snapshot_name>.

Output:

Copy
Copied!

            
            Exporting playbooks
Exporting database
Wrapping
Finished Export. File located at: /tmp/example/cot_playbooks_database_22-03-23_12-22.tar.gz

Import

The import operation allows importing data of a given snapshot into an existing COT environment.

Warning

AWX credentials should be updated by the user after importing any snapshot that includes AWX data. Please refer to the "Credentials" section for instructions.

Usage:

Copy
Copied!

            
            cot import [-h] -s PATH [-f] [--merge_file_server_files] [--components {all,playbooks,file_server,database,awx} [{all,playbooks,file_server,database,awx} ...]]

Options:

Option	Description
`--merge_file_server_files`	Adds the file server files from the snapshot to the existing files in the file server of the COT environment. Warning Without this flag, the files in the file server are overridden.
`-s`	Path to snapshot file.
`--components`	List of components to import, separated by spaces. Warning If not provided, the command imports the data of all the components contained in the snapshot.

Example:

Copy
Copied!

            
            root@cot-server:/# cot import -s /tmp/cot_snapshot_22-03-23_12-24.tar.gz --merge_file_server_files --components file_server database

This command imports the file server files and the database content from the snapshot into the COT environment. The file server files from the snapshot are added to the files that already exist in the file server.

Output:

Copy
Copied!

            
            Importing File Server
Importing database
Import finished successfully from snapshot: /tmp/cot_snapshot_22-03-23_12-24.tar.gz

Apply

The apply operation provides the ability to set new configuration to an existing COT environment.

Example:

Copy
Copied!

            
            root@cot-server:/# cot apply -f cot_config.yaml

For more details regarding the COT configuration file structure, please refer to section "Configuration File".

Deploying via Docker

This section describes how to deploy the cluster bring-up framework utilizing Docker.

The installation is performed using a single Docker image including COT and all its dependencies.

Loading Docker Image

The Docker image is provided as a tar archive file. Use the docker load command to load the image into the Docker system.

Example:

Copy
Copied!

            
            root@cot-server:/# cot apply -f cot_config.yaml

Make sure the image is present when running docker images, under the name cot/bootstrap.

Running Docker Image

The Docker container must have access to the Docker daemon on the host. For this purpose, the Docker socket must be mounted as a volume.

Use the following command to start the container:

Copy
Copied!

            
            root@cot-server:/# docker run -v /var/run/docker.sock:/var/run/docker.sock -v /root/.k3d:/root/.k3d -it cot/bootstrap:3.0.2 bash run.sh <options>

For information regarding the available options, refer to section "Installation Script".

On This Page

Cluster Bring-Up Web Installation Steps

Deploying Cluster Bring-Up WEB Framework

Prerequisites

System Requirements

Supported Operating Systems

Supported Deployment Platforms

Deploying via Kubernetes

Installation

Installation with Image

Prerequisites

Installation Steps

Restore Image

Restore VM Using Hypervisor

KVM

Dependencies

Restoration Steps

Restore on Bare Metal

Supported Hardware

Restoration Steps

Installation Script

Configuration File

Pass-Fail Criteria

COT Summary Report Configuration

COT Thresholds

Upgrading Framework Script

COT API

Update

Show

Export

Import

Apply

Deploying via Docker

Loading Docker Image

Running Docker Image