NVIDIA UFM Telemetry Documentation v1.9
NVIDIA UFM Telemetry Documentation v1.9

Software Management

NVIDIA® UFM® Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.

To deploy the UFM Telemetry:

  1. Ensure the following prerequisites are installed:

    1. Python3

    2. Python3-venv

    3. Supervisor

  2. Copy the tarball package to the target location.

  3. Extract package.

    Copy
    Copied!
                

    tar -xf collectx-1.10.0-*.tar.gz

  4. Initialize and configure.

    Copy
    Copied!
                

    ./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1"

    Warning

    This collects port counter data every 5 minutes, and uses HCA mlx5_0 and writes data to /tmp/clx_data.

  5. Start data collection.

    Copy
    Copied!
                

    supervisord --config /tmp/ufm_telemetry/conf/supervisord.conf

NVIDIA UFM Telemetry is packaged as a docker image that should be loaded and deployed on a Linux machine with docker installed. This section describes how to deploy the UFM Telemetry docker image on a Linux machine.

To deploy the UFM telemetry, perform the following steps:

  1. Make sure that docker is installed on the Linux machine.

    Copy
    Copied!
                

    [root@r-ufm ~]# docker –version

  2. Start the docker service.

    Copy
    Copied!
                

    [root@r-ufm ~]# sudo service docker start

  3. Pull the image.

    Copy
    Copied!
                

    [root@r-ufm ~]# export image=mellanox/ufm-telemetry:<version> [root@r-ufm ~]# sudo docker pull mellanox/ufm-telemetry:<version>

  4. Create the default .ini files and place them in the local directory mapped to /config in the container and initialize the container configuration.

    Copy
    Copied!
                

    root@r-ufm ~]# sudo docker run -v /tmp/config:/config --rm -d $image /get_collectx_configs.sh "sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"

    Warning

    This collects port counter data every 5 minutes and uses HCA mlx5_0. It also collects cable info on the 1st, 3rd, and 5th day of the week at midnight, where:

    • sample_rate: Frequency of collecting port counters

    • hca: Card to use

    • cable_info_schedule: Time of collecting cable info data (optional)

  5. Create a container of UFM telemetry.

    Copy
    Copied!
                

    root@r-ufm ~]# sudo docker run --net=host --uts=host --ipc=host \ --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --device=/dev/infiniband/ -v "/tmp/config:/config" -v "/tmp/data:/data" -v "/opt/ufm/files/licenses:/opt/ufm/files/licenses/" --rm --name ufm-telemetry -d $image

  6. Verify that UFM Telemetry is running.

    1. Make sure the UFM Telemetry container is up.

      Copy
      Copied!
                  

      [root@r-ufm ~]# docker ps

    2. If the container name exists, access the shell of the container.

      Copy
      Copied!
                  

      [root@r-ufm ~]# docker exec -it ufm-telemetry bash

    3. Review your configurations under /config/launch_ibdiagnet_config.ini.

  7. View the UFM Telemetry configuration files.

    Copy
    Copied!
                

    root@ r-ufm ~]# ls -l /config/ -rw-r--r-- 1 3478 101 396 Apr 15 21:04 clx_config.ini -rw-r--r-- 1 3478 101 2987 Apr 15 21:04 collectx.ini -rw-r--r-- 1 3478 101 4257 Apr 15 21:04 launch_ibdiagnet_config.ini -rw-r--r-- 1 3478 101 1912 Apr 16 12:03 supervisord.conf

  8. To watch and review the execution of the various components, you can check the log files under /var/log. Each component has a dedicated log file. Running the "ls -l" command will display all files under the folder. The following output shows only the relevant log files (other files have been omitted).

    Copy
    Copied!
                

    [root@r-ufm ~]# ls -l /var/log -rw-r--r-- 1 root root 128393 Apr 3 10:49 launch_cableinfo.log -rw-r--r-- 1 root root 467 Apr 3 09:35 launch_compression.log -rw-r--r-- 1 root root 194566 Apr 3 10:49 launch_ibdiagnet.log -rw-r--r-- 1 root root 798 Apr 3 09:35 launch_retention.log -rw-r--r-- 1 root root 1729 Apr 3 09:56 supervisord.log

  9. To exit the UFM Telemetry docker context, run "exit" to return to the Linux machine context.

  10. To access the UFM Telemetry CLI, run the following command on the Linux machine:

    Copy
    Copied!
                

    [root@r-ufm ~]# docker exec -it ufm-telemetry clxcli

  11. For settings and configuration instructions, see Settings and Configuration.

NVIDIA UFM Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.

To deploy the UFM Telemetry in Bringup mode, perform the following steps:

  1. Make sure the following prerequisites are installed:

    1. Python3

    2. Python3-venv

    3. Supervisor

  2. Copy the tarball package to the targeted location.

  3. Extract the package.

    Copy
    Copied!
                

    tar -xf collectx-1.10.0-*.tar.gz

  4. Start collection.

    Copy
    Copied!
                

    ./bin/run_bringup.sh   CollectX: collection_start   This collects port counter and cable data every minute, uses HCA mlx5_0 and writes data to ./collection_data/clx-bringup-X for a period of 24h   CollectX: help collection_start   Usage:   options defaults ------- -------- collection_start time|duration=n [s|m|h|d] 24h sample_rate=n [s|m|h|d] 60 seconds guids=[guid_list|guid_file] None hca=hca_name mlx5_0 cable|cable_info=[yes|no|once] yes reset_counters=t false mads_retries=n 2 mads_timeout=n (msec) 500 force_hca=t f

Upgrading UFM Telemetry requires removing the previous ufm-telemetry container, pulling the new version of the UFM telemetry image, configuring the telemetry, and starting a new container from the new image.

  1. Stop the previous ufm-telemetry container.

    Copy
    Copied!
                

    [root@r-ufm ~]# docker stop ufm-telemetry

  2. Pull the new UFM Telemetry image.

    Copy
    Copied!
                

    [root@r-ufm ~]# export image=mellanox/ufm-telemetry:rhel7.3_x86_64_ofed5.1-2.3.7_release_1.6_latest [root@r-ufm ~]# docker pull $image

  3. Configure UFM Telemetry based on new configurations.

    Copy
    Copied!
                

    [root@r-ufm ~]# docker run -v /tmp/config:/config --rm -d $image /get_collectx_configs.sh sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"

  4. Create a container for new UFM Telemetry.

    Copy
    Copied!
                

    [root@r-ufm ~]# docker run --net=host --uts=host --ipc=host \ --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --device=/dev/infiniband/ -v "/tmp/config:/config" -v "/tmp/data:/data" --rm --name ufm-telemetry -d $image

© Copyright 2023, NVIDIA. Last updated on Sep 5, 2023.