NVIDIA Docs Hub NVIDIA Networking Networking Software Management Software NVIDIA UFM Telemetry Documentation v1.18.2 Software Management

Software Management

Deploying UFM Telemetry

Deploying UFM Telemetry can be done in the following modes:

Bare Metal - Bringup Mode
Docker Container Mode
Docker Container Mode - High Availability
Bare Metal Mode
Bare Metal Mode - High Availability

Bare Metal - Bringup Mode

NVIDIA UFM Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.

To deploy the UFM Telemetry in Bringup mode, perform the following steps:

Make sure the following prerequisites are installed:
1. Python3
2. Python3-venv
3. Supervisor
Copy the tarball package to the targeted location.

Extract the package.

Copy
Copied!

            
            tar -xf  ufm_telemetry-<version>.tar.gz

Start collection.

Copy
Copied!

            
            ./bin/run_bringup .sh
 
CollectX: collection_start
 
This collects port counter and cable data every minute, uses HCA mlx5_0 and writes data to ./collection_data/clx-bringup-X for a period of 24hrs.
 
CollectX: help collection_start
 
Usage:
 
                        options                          defaults      Description
                        -------                          --------      -----------
      collection_start  time|duration=n [s|m|h|d]        24h           Session duration
                        sample_rate=n [s|m|h|d]          60 seconds    Data sample rate
                        guids=[guid_list|guid_file]      None          Target devices guid
                        counter_set=[file.xcset]         None          Counter list to be collected
                        hca=hca_name                     mlx5_0        Device to access the fabric
                        cable|cable_info=[yes|no|once]   yes           Collect cable info
                        nvlink_info=[yes|no]             no            Collect NVLink info
                        disconnected_cable=[yes|no]      no            Collect disconnected cables info
                        reset_counters=t                 false         Reset counters of fabric devices
                        compress_data=[yes|no]           yes           Compress files (if write_files=true)
                        mads_retries=n                   2             Set number of retries for MADs
                        mads_timeout=n (msec)            500           Set timeout for MADs
                        force_hca=t                      f             Avoid HCA state check

NVIDIA UFM Telemetry is packaged as a docker image that should be loaded and deployed on a Linux machine with docker installed. This section describes how to deploy the UFM Telemetry docker image on a Linux machine.

To deploy the UFM telemetry, perform the following steps:

Make sure that docker is installed on the Linux machine.

Copy
Copied!

            
            [root@r-ufm ~]# docker –version

Start the docker service.

Copy
Copied!

            
            [root@r-ufm ~]# sudo service docker start

Pull the image.

Copy
Copied!

            
            [root@r-ufm ~]# export image=mellanox/ufm-telemetry:<version>
[root@r-ufm ~]# sudo docker pull $image

Create the default .ini files and place them in the local directory mapped to /config in the container and initialize the container configuration.
Copy

Copied!
```
            
            root@r-ufm ~]# sudo docker run -v  /opt/ufm-telemetry/conf:/config --rm -d $image /get_collectx_configs.sh "sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"
        
```
Note
This collects port counter data every 5 minutes and uses HCA mlx5_0. It also collects cable info on the 1st, 3rd, and 5th day of the week at midnight, where:
- sample_rate: Frequency of collecting port counters
- hca: Card to use
- cable_info_schedule: Time of collecting cable info data (optional)

Create a container of UFM telemetry.

Copy
Copied!

            
            root@r-ufm ~]# sudo docker run --net=host --uts=host --ipc=host \
              --ulimit stack=67108864 --ulimit memlock=-1 \
              --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \
              --device=/dev/infiniband/ -v "/opt/ufm-telemetry/conf:/config" -v "/tmp/data:/data" -v "/opt/ufm/files/licenses:/opt/ufm/files/licenses/" --rm --name ufm-telemetry -d $image

Verify that UFM Telemetry is running.
1. Make sure the UFM Telemetry container is up.
  Copy
  
  Copied!
```
            
            [root@r-ufm ~]# docker ps
        
```
2. If the container name exists, access the shell of the container.
  Copy
  
  Copied!
```
            
            [root@r-ufm ~]# docker exec -it ufm-telemetry bash
        
```
3. Review your configurations under /config/launch_ibdiagnet_config.ini.

View the UFM Telemetry configuration files.

Copy
Copied!

            
            root@ r-ufm ~]# ls -l /config/
-rw-r--r-- 1 3478 101  396 Apr 15 21:04 clx_config.ini
-rw-r--r-- 1 3478 101 2987 Apr 15 21:04 collectx.ini
-rw-r--r-- 1 3478 101 4257 Apr 15 21:04 launch_ibdiagnet_config.ini
-rw-r--r-- 1 3478 101 1912 Apr 16 12:03 supervisord.conf

To watch and review the execution of the various components, you can check the log files under /var/log. Each component has a dedicated log file. Running the "ls -l" command will display all files under the folder. The following output shows only the relevant log files (other files have been omitted).

Copy
Copied!

            
            [root@r-ufm ~]# ls -l /var/log
-rw-r--r-- 1 root root 128393 Apr  3 10:49 launch_cableinfo.log
-rw-r--r-- 1 root root    467 Apr  3 09:35 launch_compression.log
-rw-r--r-- 1 root root 194566 Apr  3 10:49 launch_ibdiagnet.log
-rw-r--r-- 1 root root    798 Apr  3 09:35 launch_retention.log
-rw-r--r-- 1 root root   1729 Apr  3 09:56 supervisord.log

To exit the UFM Telemetry docker context, run "exit" to return to the Linux machine context.

To access the UFM Telemetry CLI, run the following command on the Linux machine:

Copy
Copied!

            
            [root@r-ufm ~]# docker exec -it ufm-telemetry clxcli

For settings and configuration instructions, see Settings and Configuration.

Docker Container Mode - High Availability

Requirements:

An important requirement for the HA solution is to prepare a dedicated partition for DRBD to work with. Example of such a requirement: /dev/sda4.
Install pcs and drbd-utils on both servers (using “yum” or “apt-get install”, based on your OS.

Note

On RH/CentOS, please run “yum install pcs drbd84-utils kmod-drbd84.

Procedure:

Load (pull) the latest UFM Telemetry Docker image on both servers.

Copy
Copied!

            
            docker pull mellanox/ufm-telemetry:latest

Run the Telemetry configuration command on both servers.

Copy
Copied!

            
            docker run --rm -i --name=config-telemetry \
-v /opt/ufm-telemetry/conf:/config \
-v /etc/systemd/system:/etc/systemd/system \
-v /var/run/docker.sock:/var/run/docker.sock \
mellanox/ufm-telemetry:latest \
/get_collectx_configs.sh \
--gen_service \
--config=ufm_telemetry

Refresh systemd on both servers:

Copy
Copied!

            
            systemctl daemon-reload

Create the /opt/ufm-telemetry/licenses/ directory on the master server and copy the UFM Telemetry license file there.
Download UFM-HA Package on both servers from this link.
Extract the HA package to /tmp/, and from there, run the installation command on both servers as follows:

Note

In the below commands, "disk", the partition name, is assumed as /dev/sda4.
Copy

Copied!
```
            
            ./install -l /opt/ufm-telemetry/ -d /dev/sda4 -p telemetry
        
```
Run the UFM-HA configuration command ONLY on the master server, as follows:
Copy

Copied!
```
            
            configure_ha_nodes.sh \
--cluster-password 12345678 \
--master-ip 192.168.10.1 \
--standby-ip 192.168.10.2 \
--virtual-ip 192.168.10.5
        
```
Note

The cluster-password must be at least 8 characters long.

Note

Change the values of in the above command with your server' information.

Start UFM Telemetry HA cluster. Run:

Copy
Copied!

            
            ufm_ha_cluster start

Bare Metal Mode

NVIDIA® UFM® Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.

To deploy the UFM Telemetry:

Ensure the following prerequisites are installed:
1. Python3
2. Python3-venv
3. Supervisor
Copy the tarball package to the target location.

Extract package.

Copy
Copied!

            
            tar -xf ufm_telemetry-<version>.tar.gz

Initialize and configure.

Copy
Copied!

            
            ./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1"

Note

This collects port counter data every 5 minutes, and uses HCA mlx5_0 and writes data to /tmp/clx_data.

Start data collection.

Copy
Copied!

            
            supervisord --config /tmp/ufm_telemetry/conf/supervisord.conf

Bare Metal Mode - High Availability

NVIDIA® UFM® Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.

To deploy the UFM Telemetry:

Ensure the following prerequisites are installed:
1. Python3
2. Python3-venv
3. Supervisor
Copy the tarball package to the target location.

Extract package.

Copy
Copied!

            
            tar -xf ufm_telemetry -<version>.tar.gz

Initialize and configure.

Copy
Copied!

            
            ./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1" --gen_systemd_service

Note

This collects port counter data every 5 minutes, and uses HCA mlx5_0 and writes data to /tmp/clx_data.

Download UFM-HA Package on both servers from this link.
Extract the HA package to /tmp/, and from there, run the installation command on both servers as follows:

Note

In the below commands, "disk", the partition name, is assumed as /dev/sda4.
Copy

Copied!
```
            
            ./install -l /opt/ufm-telemetry/ -d /dev/sda4 -p telemetry
        
```
Run the UFM-HA configuration command ONLY on the master server, as follows:
Copy

Copied!
```
            
            configure_ha_nodes.sh \
--cluster-password 12345678 \
--master-ip 192.168.10.1 \
--standby-ip 192.168.10.2 \
--virtual-ip 192.168.10.5
        
```
Note

The cluster-password must be at least 8 characters long.

Note

Change the values of in the above command with your server' information.

Start UFM Telemetry HA cluster. Run:

Copy
Copied!

            
            ufm_ha_cluster start

To check the status of your UFM Telemetry HA cluster, run:

Copy
Copied!

            
            ufm_ha_cluster status

To perform failover, run:

Copy
Copied!

            
            ufm_ha_cluster failover

To perform takeover, run:

Copy
Copied!

            
            ufm_ha_cluster takeover

Upgrading UFM Telemetry Software

Upgrading UFM Telemetry requires removing the previous package, pulling the new version of the UFM telemetry package, configuring the telemetry, and starting it from the new installation package.

The upgrade procedure can done in the three modes:

Bare Metal - Bringup Mode
Docker Container Mode
Bare Metal Mode

Bare Metal - Bringup Mode

Stop previous collection. Run:

Copy
Copied!

            
            ./bin/run_bringup.sh
CollectX: collection_stop

Follow instructions described in Deploying UFM Telemetry - Bare Metal Mode with the new UFM Telemetry version.
If needed, apply the previous configuration changes.

Docker Container Mode

Stop the previous ufm-telemetry container.

Copy
Copied!

            
            [root@r-ufm ~]# docker stop ufm-telemetry

Pull the new UFM Telemetry image.

Copy
Copied!

            
            [root@r-ufm ~]# export image=mellanox/ufm-telemetry:rhel7.3_x86_64_ofed5.1-2.3.7_release_1.6_latest
[root@r-ufm ~]# docker pull $image

Create a container for new UFM Telemetry.

Copy
Copied!

            
            [root@r-ufm ~]# docker run --net=host --uts=host --ipc=host \
              --ulimit stack=67108864 --ulimit memlock=-1 \
              --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \
              --device=/dev/infiniband/ -v "/opt/ufm-telemetry/conf:/config" -v "/tmp/data:/data" --rm --name ufm-telemetry -d $image

Configure the UFM Telemetry based on the new configurations.

Copy
Copied!

            
            [root@r-ufm ~]# docker run -v /opt/ufm-telemetry/conf:/config --rm -d $image /get_collectx_configs.sh sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"

Bare Metal Mode

Stop previous collection. Run:

Copy
Copied!

            
            kill $SUPERVISORD_PID # send sigterm to the supervisord proc

Follow instructions described in Deploying UFM Telemetry - Bringup Mode with the new UFM Telemetry version.
If needed, apply the previous configuration changes.

On This Page