Deploying UFM Telemetry Bare Metal
NVIDIA® UFM® Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.
To deploy the UFM Telemetry:
- Ensure the following prerequisites are installed:
- Python3
- Python3-venv
- Supervisor
Copy the tarball package to the target location.
Extract package.
tar -xf collectx-1.10.0-*.tar.gz
Initialize and configure.
./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1"
This collects port counter data every 5 minutes, and uses HCA mlx5_0 and writes data to /tmp/clx_data.
Start data collection.
supervisord --config /tmp/ufm_telemetry/conf/supervisord.conf
Deploying UFM Telemetry Docker
NVIDIA UFM Telemetry is packaged as a docker image that should be loaded and deployed on a Linux machine with docker installed. This section describes how to deploy the UFM Telemetry docker image on a Linux machine.
To deploy the UFM telemetry, perform the following steps:
Make sure that docker is installed on the Linux machine.
[root@r-ufm ~]# docker –version
Start the docker service.
[root@r-ufm ~]# sudo service docker start
Pull the image.
[root@r-ufm ~]# export image=mellanox/ufm-telemetry:<version> [root@r-ufm ~]# sudo docker pull mellanox/ufm-telemetry:<version>
Create the default .ini files and place them in the local directory mapped to /config in the container and initialize the container configuration.
root@r-ufm ~]# sudo docker run -v /tmp/config:/config --rm -d $image /get_collectx_configs.sh "sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"
This collects port counter data every 5 minutes and uses HCA mlx5_0. It also collects cable info on the 1st, 3rd, and 5th day of the week at midnight, where:
- sample_rate: Frequency of collecting port counters
- hca: Card to use
- cable_info_schedule: Time of collecting cable info data (optional)
Create a container of UFM telemetry.
root@r-ufm ~]# sudo docker run --net=host --uts=host --ipc=host \ --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --device=/dev/infiniband/ -v "/tmp/config:/config" -v "/tmp/data:/data" -v "/opt/ufm/files/licenses:/opt/ufm/files/licenses/" --rm --name ufm-telemetry -d $image
- Verify that UFM Telemetry is running.
Make sure the UFM Telemetry container is up.
[root@r-ufm ~]# docker ps
If the container name exists, access the shell of the container.
[root@r-ufm ~]# docker exec -it ufm-telemetry bash
- Review your configurations under
/config/launch_ibdiagnet_config.ini
.
View the UFM Telemetry configuration files.
root@ r-ufm ~]# ls -l /config/ -rw-r--r-- 1 3478 101 396 Apr 15 21:04 clx_config.ini -rw-r--r-- 1 3478 101 2987 Apr 15 21:04 collectx.ini -rw-r--r-- 1 3478 101 4257 Apr 15 21:04 launch_ibdiagnet_config.ini -rw-r--r-- 1 3478 101 1912 Apr 16 12:03 supervisord.conf
To watch and review the execution of the various components, you can check the log files under
/var/log
. Each component has a dedicated log file. Running the "ls -l" command will display all files under the folder. The following output shows only the relevant log files (other files have been omitted).[root@r-ufm ~]# ls -l /var/log -rw-r--r-- 1 root root 128393 Apr 3 10:49 launch_cableinfo.log -rw-r--r-- 1 root root 467 Apr 3 09:35 launch_compression.log -rw-r--r-- 1 root root 194566 Apr 3 10:49 launch_ibdiagnet.log -rw-r--r-- 1 root root 798 Apr 3 09:35 launch_retention.log -rw-r--r-- 1 root root 1729 Apr 3 09:56 supervisord.log
- To exit the UFM Telemetry docker context, run "exit" to return to the Linux machine context.
To access the UFM Telemetry CLI, run the following command on the Linux machine:
[root@r-ufm ~]# docker exec -it ufm-telemetry clxcli
- For settings and configuration instructions, see Settings and Configuration.
Deploying UFM Telemetry in Bringup Mode
NVIDIA UFM Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.
To deploy the UFM Telemetry in Bringup mode, perform the following steps:
- Make sure the following prerequisites are installed:
- Python3
- Python3-venv
- Supervisor
- Copy the tarball package to the targeted location.
Extract the package.
tar -xf collectx-1.10.0-*.tar.gz
Start collection.
./bin/run_bringup.sh CollectX: collection_start This collects port counter and cable data every minute, uses HCA mlx5_0 and writes data to ./collection_data/clx-bringup-X for a period of 24h CollectX: help collection_start Usage: options defaults ------- -------- collection_start time|duration=n [s|m|h|d] 24h sample_rate=n [s|m|h|d] 60 seconds guids=[guid_list|guid_file] None hca=hca_name mlx5_0 cable|cable_info=[yes|no|once] yes reset_counters=t false mads_retries=n 2 mads_timeout=n (msec) 500 force_hca=t f
Upgrading UFM Telemetry Software
Upgrading UFM Telemetry requires removing the previous ufm-telemetry container, pulling the new version of the UFM telemetry image, configuring the telemetry, and starting a new container from the new image.
Stop the previous ufm-telemetry container.
[root@r-ufm ~]# docker stop ufm-telemetry
Pull the new UFM Telemetry image.
[root@r-ufm ~]# export image=mellanox/ufm-telemetry:rhel7.3_x86_64_ofed5.1-2.3.7_release_1.6_latest [root@r-ufm ~]# docker pull $image
Configure UFM Telemetry based on new configurations.
[root@r-ufm ~]# docker run -v /tmp/config:/config --rm -d $image /get_collectx_configs.sh sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"
Create a container for new UFM Telemetry.
[root@r-ufm ~]# docker run --net=host --uts=host --ipc=host \ --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --device=/dev/infiniband/ -v "/tmp/config:/config" -v "/tmp/data:/data" --rm --name ufm-telemetry -d $image