Software Management
NVIDIA® UFM® Telemetry can be obtained as a tarball on a Linux machine with all prerequisites installed.
To deploy the UFM Telemetry:
Connect to the Linux machine via SSH.
Ensure the following prerequisites are installed:
Python3
Python3-venv
Supervisor
copy the tarball package to the target location
extract package
tar -xf collectx-1.8.0-*.tar.gz
Initialize and Configure
./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;arg_12=;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1"
WarningThis collects port counter data every 5 minutes, uses HCA mlx5_0 and writes data to /tmp/clx_data
Start data collection:
supervisord --config /tmp/ufm_telemetry/conf/supervisord.conf
NVIDIA® UFM® Telemetry is packaged in a docker image that should be loaded and deployed on a Linux machine with docker installed (as a prerequisite). This chapter describes how to deploy UFM telemetry on a Linux machine.
To deploy the UFM telemetry:
Connect to the Linux machine via SSH.
Ensure the docker is installed on the Linux machine. Run:
[root@r-ufm ~]# docker –version
Start the docker service. Run:
[root@r-ufm ~]# sudo service docker start
Pull the image. Run:
[root@r-ufm ~]# export image=mellanox/ufm-telemetry:<version> [root@r-ufm ~]# sudo docker pull mellanox/ufm-telemetry:<version>
Create the default .ini files and place them in the local directory mapped to /config in the container and initialize the container configuration. Run:
[root@r-ufm ~]# sudo docker run -v /tmp/config:/config --rm -d $image /get_collectx_configs.sh "sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"
This collects port counter data every 5 minutes, uses HCA mlx5_0, and collects cable info on the 1st, 3rd, and 5th day of the week at midnight.
Where:sample_rate: Frequency of collecting port counters
hca: Card to use
cable_info_schedule: Time of collecting cable info data (optional)
Create a container of UFM telemetry. Run:
[root@r-ufm ~]# sudo docker run --net=host --uts=host --ipc=host \ --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --device=/dev/infiniband/ -v "/tmp/config:/config" -v "/tmp/data:/data" -v "/opt/ufm/files/licenses:/opt/ufm/files/licenses/" --rm --name ufm-telemetry -d $image
Verify that UFM Telemetry is running:
Ensure the UFM telemetry container is up. Run:
[root@r-ufm ~]# docker ps
If the container name exists, access the shell of the container. Run:
[root@r-ufm ~]# sudo docker exec -it ufm-telemetry bash
Run "ps -fade" and verify that the list of running processes includes agx, clx, supervisord, agx_manager.py, agx_server.py, launch_ibdiagnet.py, launch_retention.py, launch_compression.py, launch_cableinfo.py.
[root@r-ufm workspace]# ps -fade
Review your configurations under "/config/launch_ibdiagnet_config.ini".
To view the UFM telemetry configuration files. Run:
[root@ r-ufm ~]# ls -l /config/ -rw-r--r-- 1 3478 101 396 Jul 15 21:04 clx_config.ini -rw-r--r-- 1 3478 101 2987 Jul 15 21:04 collectx.ini -rw-r--r-- 1 3478 101 4257 Jul 15 21:04 launch_ibdiagnet_config.ini -rw-r--r-- 1 3478 101 1912 Jul 16 12:03 supervisord.conf
To watch and review the execution of each component, there is a log file for each component located under the path "/var/log".
[root@r-ufm ~]# ls -l /var/log drwxr-xr-x 2 root root 4096 Aug 25 11:51 agx drwxr-xr-x 2 root root 4096 Aug 25 11:48 clx -rw-r--r-- 1 root root 63733 Sep 3 10:49 clx.log -rw-r--r-- 1 root root 43458 Sep 3 10:49 agx_manager.log -rw-r--r-- 1 root root 111556 Sep 3 10:49 agx_server.log -rw-r--r-- 1 root root 128393 Sep 3 10:49 launch_cableinfo.log -rw-r--r-- 1 root root 467 Sep 3 09:35 launch_compression.log -rw-r--r-- 1 root root 194566 Sep 3 10:49 launch_ibdiagnet.log -rw-r--r-- 1 root root 798 Sep 3 09:35 launch_retention.log -rw-r--r-- 1 root root 1729 Sep 3 09:56 supervisord.log
To exit the UFM Telemetry docker context, run "exit" to return to the Linux machine context.
To access the UFM Telemetry CLI, run the following on the Linux machine:
[root@r-ufm ~]# docker exec -it ufm-telemetry clxcli
For settings and configuration instructions, see Settings and Configuration.
Upgrading UFM Telemetry requires removing the previous ufm-telemetry container, pulling the new version of the UFM telemetry image, configuring the telemetry, and starting a new container from the new image.
Connect to the Linux machine via SSH.
Stop the previous ufm-telemetry container. Run:
[root@r-ufm ~]# sudo docker stop ufm-telemetry
Pull the new UFM Telemetry image. Run:
[root@r-ufm ~]# export image=mellanox/ufm-telemetry:rhel7.3_x86_64_ofed5.1-2.3.7_release_1.6_latest [root@r-ufm ~]# sudo docker pull $image
Configure UFM Telemetry using based on new configurations. Run:
[root@r-ufm ~]# sudo docker run -v /tmp/config:/config --rm -d $image /get_collectx_configs.sh sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"
Create a container for new UFM Telemetry. Run:
[root@r-ufm ~]# sudo docker run --net=host --uts=host --ipc=host \ --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --device=/dev/infiniband/ -v "/tmp/config:/config" -v "/tmp/data:/data" --rm --name ufm-telemetry -d $image