Software Management

NVIDIA UFM Telemetry Documentation v1.8

NVIDIA® UFM® Telemetry can be obtained as a tarball on a Linux machine with all prerequisites installed.

To deploy the UFM Telemetry:

  1. Connect to the Linux machine via SSH.

  2. Ensure the following prerequisites are installed:

    1. Python3

    2. Python3-venv

    3. Supervisor

  3. copy the tarball package to the target location

  4. extract package

    Copy
    Copied!
                

    tar -xf collectx-1.8.0-*.tar.gz

  5. Initialize and Configure

    Copy
    Copied!
                

    ./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;arg_12=;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1"

    Warning

    This collects port counter data every 5 minutes, uses HCA mlx5_0 and writes data to /tmp/clx_data

  6. Start data collection:

    Copy
    Copied!
                

    supervisord --config /tmp/ufm_telemetry/conf/supervisord.conf

NVIDIA® UFM® Telemetry is packaged in a docker image that should be loaded and deployed on a Linux machine with docker installed (as a prerequisite). This chapter describes how to deploy UFM telemetry on a Linux machine.

To deploy the UFM telemetry:

  1. Connect to the Linux machine via SSH.

  2. Ensure the docker is installed on the Linux machine. Run:

    Copy
    Copied!
                

    [root@r-ufm ~]# docker –version

  3. Start the docker service. Run:

    Copy
    Copied!
                

    [root@r-ufm ~]# sudo service docker start

  4. Pull the image. Run:

    Copy
    Copied!
                

    [root@r-ufm ~]# export image=mellanox/ufm-telemetry:<version> [root@r-ufm ~]# sudo docker pull mellanox/ufm-telemetry:<version>

  5. Create the default .ini files and place them in the local directory mapped to /config in the container and initialize the container configuration. Run:

    Copy
    Copied!
                

    [root@r-ufm ~]# sudo docker run -v /tmp/config:/config --rm -d $image /get_collectx_configs.sh "sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"

    This collects port counter data every 5 minutes, uses HCA mlx5_0, and collects cable info on the 1st, 3rd, and 5th day of the week at midnight.
    Where:

    • sample_rate: Frequency of collecting port counters

    • hca: Card to use

    • cable_info_schedule: Time of collecting cable info data (optional)

  6. Create a container of UFM telemetry. Run:

    Copy
    Copied!
                

    [root@r-ufm ~]# sudo docker run --net=host --uts=host --ipc=host \ --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --device=/dev/infiniband/ -v "/tmp/config:/config" -v "/tmp/data:/data" -v "/opt/ufm/files/licenses:/opt/ufm/files/licenses/" --rm --name ufm-telemetry -d $image

  7. Verify that UFM Telemetry is running:

    1. Ensure the UFM telemetry container is up. Run:

      Copy
      Copied!
                  

      [root@r-ufm ~]# docker ps

    2. If the container name exists, access the shell of the container. Run:

      Copy
      Copied!
                  

      [root@r-ufm ~]# sudo docker exec -it ufm-telemetry bash

    3. Run "ps -fade" and verify that the list of running processes includes agx, clx, supervisord, agx_manager.py, agx_server.py, launch_ibdiagnet.py, launch_retention.py, launch_compression.py, launch_cableinfo.py.

      Copy
      Copied!
                  

      [root@r-ufm workspace]# ps -fade

    4. Review your configurations under "/config/launch_ibdiagnet_config.ini".

  8. To view the UFM telemetry configuration files. Run:

    Copy
    Copied!
                

    [root@ r-ufm ~]# ls -l /config/ -rw-r--r-- 1 3478 101 396 Jul 15 21:04 clx_config.ini -rw-r--r-- 1 3478 101 2987 Jul 15 21:04 collectx.ini -rw-r--r-- 1 3478 101 4257 Jul 15 21:04 launch_ibdiagnet_config.ini -rw-r--r-- 1 3478 101 1912 Jul 16 12:03 supervisord.conf

  9. To watch and review the execution of each component, there is a log file for each component located under the path "/var/log".

    Copy
    Copied!
                

    [root@r-ufm ~]# ls -l /var/log drwxr-xr-x 2 root root 4096 Aug 25 11:51 agx drwxr-xr-x 2 root root 4096 Aug 25 11:48 clx -rw-r--r-- 1 root root 63733 Sep 3 10:49 clx.log -rw-r--r-- 1 root root 43458 Sep 3 10:49 agx_manager.log -rw-r--r-- 1 root root 111556 Sep 3 10:49 agx_server.log -rw-r--r-- 1 root root 128393 Sep 3 10:49 launch_cableinfo.log -rw-r--r-- 1 root root 467 Sep 3 09:35 launch_compression.log -rw-r--r-- 1 root root 194566 Sep 3 10:49 launch_ibdiagnet.log -rw-r--r-- 1 root root 798 Sep 3 09:35 launch_retention.log -rw-r--r-- 1 root root 1729 Sep 3 09:56 supervisord.log

  10. To exit the UFM Telemetry docker context, run "exit" to return to the Linux machine context.

  11. To access the UFM Telemetry CLI, run the following on the Linux machine:

    Copy
    Copied!
                

    [root@r-ufm ~]# docker exec -it ufm-telemetry clxcli

  12. For settings and configuration instructions, see Settings and Configuration.

Upgrading UFM Telemetry requires removing the previous ufm-telemetry container, pulling the new version of the UFM telemetry image, configuring the telemetry, and starting a new container from the new image.

  1. Connect to the Linux machine via SSH.

  2. Stop the previous ufm-telemetry container. Run:

    Copy
    Copied!
                

    [root@r-ufm ~]# sudo docker stop ufm-telemetry

  3. Pull the new UFM Telemetry image. Run:

    Copy
    Copied!
                

    [root@r-ufm ~]# export image=mellanox/ufm-telemetry:rhel7.3_x86_64_ofed5.1-2.3.7_release_1.6_latest [root@r-ufm ~]# sudo docker pull $image

  4. Configure UFM Telemetry using based on new configurations. Run:

    Copy
    Copied!
                

    [root@r-ufm ~]# sudo docker run -v /tmp/config:/config --rm -d $image /get_collectx_configs.sh sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"

  5. Create a container for new UFM Telemetry. Run:

    Copy
    Copied!
                

    [root@r-ufm ~]# sudo docker run --net=host --uts=host --ipc=host \ --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --device=/dev/infiniband/ -v "/tmp/config:/config" -v "/tmp/data:/data" --rm --name ufm-telemetry -d $image

© Copyright 2023, NVIDIA. Last updated on Sep 5, 2023.