NVIDIA Telemetry Agent User Manual v2.7.20
NVIDIA Telemetry Agent User Manual v2.7.20

Deployment

The Telemetry Agent is packaged in a Docker image that should be loaded and deployed on a supporting Mellanox Spectrum® Ethernet Switch. This section describes how to deploy the Docker image on the switch.

Warning

The NEO application features automated deployment of the Telemetry Agent on Mellanox Spectrum switch systems. For more information, please refer to the NEO Telemetry Agent Appendix in the NEO User Manual.

Warning

Before deploying the Telemetry Agent on the switch, make sure that the switch is docker-enabled. For example, when using Mellanox Onyx, you can verify that the docker is enabled using the "show docker" command, and when needed, enable the docker using the "docker no shutdown" command.

To deploy the Docker image, the following steps should be performed:

  1. Download the NEO Telemetry Agent from the Mellanox customer portal and copy it to a remote server.

  2. Connect to the Mellanox switch via SSH.

  3. Enter the switch CLI mode:

    Copy
    Copied!
                

    switch > enable switch # configure terminal

  4. Copy the Docker image from the remote server, for example:

    Copy
    Copied!
                

    switch (config) # image fetch scp://admin:qwerty@10.20.30.100/docker_files/docker_images/telemetry-agent_<version>.img.gz

  5. Make sure that the Docker service is running.

    Copy
    Copied!
                

    switch (config) # no docker shutdown

  6. Load the image, using the docker load <image_name> command:

    Copy
    Copied!
                

    switch (config) # docker load telemetry-agent_<version>.img.gz

  7. Once the image is copied to the switch, deploy it using the following command:

    Copy
    Copied!
                

    switch (config) # docker start mellanox/telemetry-agent <version> <container name> now-and-init cpus 0.5 memory 300 privileged network sdk

  8. Run the configuration write command:

    Copy
    Copied!
                

    switch (config) # configuration write

  9. The telemetry agent must create trust with the switch in order to allow telemetry on LAGs and MLAGs. Run:

    Copy
    Copied!
                

    switch (config) # docker exec [docker instance name] "/opt/telemetry/utils/create_trust.sh"

    1. Copy the key generated and printed on your screen:

      Copy
      Copied!
                  

      switch (config) # docker exec <docker-instance-name> /opt/telemetry/utils/create_trust.sh Running exec_name: [/opt/telemetry/utils/create_trust.sh]   Generating public/private rsa key pair. Crated directory '/root/.ssh'. Your identification has been saved in /root/.ssh/id rsa. Your public key has been saved in /root/.ssh/id_ rsa.pub. The key fingerprint is: root@switch The kye's randomart image is: +---[RSA 20 8)-----+ | | | | | | | | | | | | | | | | | | | | +------------------+ ssh-rsa Some1Random2Genraced3Key4Wich5Random6Chars7 rooc@swicch

    2. And run the following command:

      Copy
      Copied!
                  

      switch (config) # ssh client user admin authorized-key sshv2 "ssh-rsa Some1Random2Genraced3Key4Wich5Random6Chars7 rooc@switch"

  10. The Telemetry Agent is waiting for Mellanox SDK installation. Install it, using the following command from the switch prompt:

    Copy
    Copied!
                

    switch (config) # docker switch (config) # copy-sdk <container-name> to /

  11. Once Mellanox SDK was installed, the Telemetry Agent service should be automatically running on the Docker. In order to verify that the Telemetry Agent is running, do the following:

    • Make sure that the Docker has been loaded/started: find your newly created Docker name in the output of the "docker ps" command. If the Docker name exists, run:

      Copy
      Copied!
                  

      switch (config) # docker exec <container-name> "/bin/bash"

    • This will bring you into Docker standard Linux prompt. Run:

      Copy
      Copied!
                  

      "/etc/init.d/telemetryd status"

      If service is running, the output should look like the following:

      Copy
      Copied!
                  

      #/etc/init.d/telemetryd status Telemetry agent status: Telemetry agent is running

    • To exit the Telemetry Agent Docker context, run "exit" command to return to the switch CLI context.

  12. Run initial telemetry configuration:

    Copy
    Copied!
                

    switch (config) # docker exec <container-name> "bash /opt/telemetry/utils/telemetry_agent_init.sh 127.0.0.1 7654"

  13. Save the configuration.

    Copy
    Copied!
                

    switch (config) # configuration write

For initial settings and configuration instructions, see Initial Settings and Configuration.

The Telemetry Agent is running and waiting for correct configuration in the config file. In order to set the initial configuration, users must access the telemetry agent docker container using the following command:

Copy
Copied!
            

docker exec neo-agent /bin/bash.

T he path to the config file is: /opt/telemetry/conf/tm.ini.

The default structure of the tm.ini file is as follows:

Copy
Copied!
            

[Controller] controller_ip=l27 0.0.l controller_port=7654 enable_telemetry=false min_polling_interval_in_ms=l00 error_ack_check_interval=60 system_error_ack_timeout=60 session_error_ack_timeout=30 update_active=ports_interval=300 calc_rates=false # max wjh packets buffer - min value: l max value: max_buffer_packets*max_messages_per_interval<5000 max_packets_buffer=l250 max_messages_per_interval=4   [Logging] log_level=INFO   [OS] switch_os=Onyx sub_type=Ethernet sample_down_ports=false enable_lag_mlag_discovery=true   [Collector] # clean json message and remove empty fields clean_json=true # counter chunk to limit interface counters per message counters_chunk=false # size of interfaces to send per message counters_chunk_size=64 # connection timeout in seconds - min value: l connection_timeout=3 # max collector messages in queue - min value: 2, max value: 20 max_collector_messages=l0

The configuration keys are listed in the following table:

Section

Key

Type

Default Value

Optional Values

Description

Controller

controller_ip

String

127.0.0.1

Ips

Controller IP

controller_port

Int

7654

ports

Controller port

enable_telemetry

Boolean

false

true/false

Must be set to true for telemetry to start

calc_rates

Boolean

false

true/false

Return telemetry counter as rates according to the interval which exists for the counter session instead of raw data.

max_packets_buffer

Int

1250

1-5000

Maximum WJH packets buffer. The configured value is the amount of WJH events sent per interval (i.e. max_packets_buffer*max_messages_per_interval).

max_messages_per_interval

Int

4

-

Max WJH messages sent per interval

Logging

log_level

String

INFO

INFO/DEBUG/ERROR

You can view the logs in /opt/telemetry/log/telemetry.log

OS

enable_lag_mlag_discovery

Boolean

true

true/false

Enable LAG/MLAG discovery using NOS

Collector

clean_json

String

true

true/false

Clean JSON message and remove empty fields – can decrease performance

counters_chunk

String

false

true/false

Counter chunk to limit interface counters per message

counters_chunk_size

Int

64

-

Size of interfaces to send per message

connection_timeout

Int

3

≥1

Connection timeout in seconds

max_collector_messages

Int

10

2-20

Maximum collector messages in queue

  1. For the Telemetry Agent to start connection attempts to the controller, the controller_ip and controller_port must be changed to the correct provider values and the enable_telemetry parameter must be set to "true". This is possible to perform using the telemetry configuration script that is located at /opt/telemetry directory on the Docker:

    Copy
    Copied!
                

    /opt/telemetry/utils/telemetry_agent_init.sh <controller-ip> <controller-port>

  2. The Telemetry Agent will try to establish connection with the controller.

To upgrade an existing version of Telemetry Agent, the old agent and image should be deleted and reinstalled.

  1. Extract container name from the container. Run the "show docker ps" command to extract the container name with the image in which Telemetry Agent is installed.

    Copy
    Copied!
                

    switch (config) # show docker ps ------------------------------------------------------------------------------------------- Container Image:Version Created Status ------------------------------------------------------------------------------------------- neo-agent telemetry-agent:2.4.9- 27 minutes ago Up 27 minutes

  2. Stop the docker container and remove the image:

    Copy
    Copied!
                

    docker no start <container-name> docker remove image <image-name>

  3. Next, refer to ".Deployment v2.7#Deploying the Docker Image on Mellanox Onyx-Based Systems" section to reinstall the new version of Telemetry Agent.

© Copyright 2023, NVIDIA. Last updated on Nov 16, 2023.