NVIDIA MELLANOX TELEMETRY AGENT USER MANUAL V2.7.10

WJH Streaming and Integration with Telegraf, InfluxDB and Grafana (TIG) Stack

The Telegraf, InfluxDB and Grafana (TIG) stack is composed of 3 components and used for viewing and analyzing data. The solution provided below is based on TIG components installed on a single container.

TIG is constructed of the following components:

  • Telegraf: The tool that collects the data from the input with a specific format and forwards it to the InfluxDB

  • InfluxDB: The database where the data is stored (e.g. the WJH events)

  • Grafana: The visualization dashboard that presents the received data from the InfluxDB in a graphical manner

The below diagram presents a typical flow for the Grafana for What Just Happened® (WJH): Telemetry Agent will collect the WJH data from the switch and send it to Grafana for WJH container where the data is presented in a visualized manner in the Grafana bundled inside Grafana for WJH container.

telemetry.png

Running the Grafana for WJH Container

Deployment of the Grafana for WJH container should be performed on a Linux host or a VM.

Prerequisites:

  • 32 GB RAM

  • 16 CPU

To run the Grafana for WJH container:

  1. Install docker CE on CentOS 7.X:

    Copy
    Copied!
                

    yum install -y yum-utils device-mapper-persistent-data lvm2 yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo yum -y install --setopt=obsoletes=0 docker-ce docker-ce-cli containerd.io

  2. Start docker.

    Copy
    Copied!
                

    service docker start

  3. Set docker service to start when the system boots.

    Copy
    Copied!
                

    systemctl enable docker

  4. Pull the WJHGraf docker image.

    • If internet connection is available from the host, and you are pulling the image from the docker hub, run the following command:

      Copy
      Copied!
                  

      docker pull mellanox/wjhgraf

      Warning

      You may find a direct download link from Docker Hub here.

    • If you do not have internet access and cannot pull the image from the docker hub, use the following procedure to load the image on your host (assuming you have already downloaded it locally).

      Copy
      Copied!
                  

      cp <location>/wjhgraph.img.gz /tmp; docker load -i /tmp/wjhgraph.img.gz

  5. Start your container binding the external ports 3000, 8093.

    Copy
    Copied!
                

    docker run -dit --name wjhgraf --restart unless-stopped -p 3000:3000 -p 8093:8093 mellanox/wjhgraf

The Mellanox Telemetry Agent runs on a switch container and provides streaming of configurable list of counters, parameters and databases toward an external collector for processing, analysis and presentation.

Warning

Before following the procedure below, please make sure to set the date and time on the switch to be identical to the date and time on the TIG server in order for streaming to work properly. For more information on how to do that, please see HowTo Enable NTP on Mellanox Switches.

To deploy the Mellanox Telemetry Agent:

  1. Enable the Docker capability.

    Copy
    Copied!
                

    switch1 [standalone: master] (config) # no docker shutdown

  2. If you have a previous version of Mellanox Telemetry Agent installed, make sure to remove it prior to enabling your new Mellanox Telemetry Agent:

    1. Stop telemetry agent container:

      Copy
      Copied!
                  

      switch1 [standalone: master] (config) # docker no start telemetry-agent

    2. Remove telemetry agent image:

      Copy
      Copied!
                  

      switch1 [standalone: master] (config) # docker remove image mellanox/telemetry-agent <version>

  3. Disable WJH on the switch.

    Copy
    Copied!
                

    switch1 [standalone: master] (config) # no what-just-happened auto-export all enable switch1 [standalone: master] (config) # no what-just-happened all enable

  4. Pull the Mellanox Telemetry Agent docker image.

    • If internet connection is available from the switch, and the image can be pulled from the docker hub.

      Copy
      Copied!
                  

      switch1 [standalone: master] (config) # ntp server [NTP_SERVER] switch1 [standalone: master] (config) # ntp enable switch1 [standalone: master] (config) # docker pull mellanox/telemetry-agent

    • If internet connection is unavailable, and the image cannot be pulled from the docker hub.

      Copy
      Copied!
                  

      switch1 [standalone: master] (config) # image fetch scp://username:password@<IP address>/<remote path> (For example: image fetch scp://root:123456@10.20.10.105/tmp/telemetry-agent_2.4.0-9.img.gz) switch1 [standalone: master] (config) # docker load <telemetry_agent_image>

  5. Start the container.

    • If the image was pulled from docker:

      Copy
      Copied!
                  

      switch1 [standalone: master] (config) # docker start mellanox/telemetry-agent latest telemetry-agent now-and-init cpus 0.5 memory 300 privileged network sdk

    • If the image was not pulled from docker:

      Copy
      Copied!
                  

      switch1 [standalone: master] (config) # docker start telemetry-agent <telemetry_agent_version> <container-name> now-and-init cpus 0.5 memory 300 privileged network sdk   (Example of telemetry-agent version: 2.4.0-9)

      Warning

      To prevent interference with the switch operations, the CPU must be limited to half a core (0.5) and the memory consumption of the telemetry agent container must be limited to 300MB.

  6. The telemetry agent must create trust with the switch in order to allow telemetry on LAGs and MLAGs. Run:

    Copy
    Copied!
                

    switch (config) # docker exec [docker-instance-name] "/opt/telemetry/utils/create_trust.sh"

    1. Copy the key generated and printed on your screen:

      Copy
      Copied!
                  

      switch (config) # docker exec neo-agent /opt/telemetry/utils/create_trust.sh Running exec_name: [/opt/telemetry/utils/create_trust.sh]   Generating public/private rsa key pair. Crated directory '/root/.ssh'. Your identification has been saved in /root/.ssh/id rsa. Your public key has been saved in /root/.ssh/id_ rsa.pub. The key fingerprint is: root@switch The kye's randomart image is: +---[RSA 20 8)-----+ | | | | | | | | | | | | | | | | | | | | +------------------+ ssh-rsa Some1Random2Genraced3Key4Wich5Random6Chars7 rooc@switch

    2. Run the following command:

      Copy
      Copied!
                  

      switch (config) # ssh client user admin authorized-key sshv2 "ssh-rsa Some1Random2Genraced3Key4Wich5Random6Chars7 rooc@switch"

  7. Copy the SDK package from the switch to the container.

    Copy
    Copied!
                

    switch1 [standalone: master] (config) # docker copy-sdk <container-name> to /

  8. Run initial telemetry configuration:

    Copy
    Copied!
                

    switch (config) # docker exec <container-name> "bash /opt/telemetry/utils/telemetry_agent_init.sh 127.0.0.1 7654"

  9. Save the configuration.

    Copy
    Copied!
                

    switch1 [standalone: master] (config) # configuration write

Connect telemetry agent to Grafana collector and start it. To configure the agent to send data to WJHGraf, please refer to Controlling Telemetry Agent > " WJH sessions" and set the parameters for the collector as follows:

  • Destination IP: <your_WJHGraf_IP>

  • Destination port: <8093>

  • Format: <InfluxDB_Line_Protocol >

Copy
Copied!
            

root@neo-switch:/# /opt/telemetry/session-controller Connecting to telemetry agent...   Got new connection from: 10.209.37.251:59636   At any point you can type 'return' to go back to main menu or type 'quit' to exit   Telemetry Actions 1. Create session 2. Delete session 3. Delete all sessions 4. Status Please select action: 1   Telemetry Sessions 1. WJH - Samples the dropped packets buffer 2. Interface counters - Samples interface counters 3. Threshold events - Events generated every time a defined threshold is crossed 4. Histograms - Samples the buffer histograms Please select session type: 1 Subscribing session WJH...   Filter Settings 1. Default - All events except Forwarding with Notice severity 2. Custom - Custom events filtering for each WJH category Please select option: 1 Reading collector params... Please enter destination ip: <your_WJHGraf_IP> Please enter destination port (default 5123): 8093   Formats 1. JSON 2. Influx DB Line Protocol 3. Protocol Buffers 4. gRPC Please select format or enter to continue (default Influx DB Line Protocol): Please enter protocol (TCP/UDP) (default TCP): Enter "yes" to add new collector or press enter to continue:

To view Grafana for WJH:

  1. Browse to:

    Copy
    Copied!
                

    http://<WJHGraf_IP>:3000

  2. Log is using admin/admin as your username and password.

WJH visualization consists of the following dashboards:

  • Debug dashboard

  • Monitoring dashboard

Debug Dashboard

The Debug dashboard allows users to receive detailed information about all the recorded "What Just Happened?" issues. It consists of 3 main components (numbered in the screenshot below):

  1. The total count of the WJH events for the selected time range and filters.

  2. An activity graph, showing the recorded WJH events over time.

  3. A table containing the information about WJH events, such as switch IP, port, drop categorization and reasons and various packet fields.

    Dashboard_Debug.png

The following actions are available to navigate the data:

  • Select or enter the data into the various filters on the top area of the dashboard.
    This will filter the data and redraw the visualizations accordingly.

  • Change the displayed time range using the time range selector on the upper right corner of the screen.
    This will filter the data for the selected time range and refresh the visualizations accordingly.

  • Zoom into a specific time range by selecting an area on the activity graph.
    This will filter the data for the selected time range and refresh the visualizations accordingly.

  • Sort the data in the table by clicking any of the table columns (clicking a column twice will toggle the ordering between ascending and descending)

  • Navigate the table pages by clicking the number of the page at the bottom of the table

Monitoring Dashboard

The Monitoring dashboard provides a real-time pane for monitoring WJH activity in the network. The dashboard consists of the following main components (numbered in the screenshot below):

  1. An activity graph, showing the recorded WJH events over time

  2. A pie chart showing Severity distribution for the selected time range and filters

  3. A pie chart showing Drop Category distribution for the selected time range and filters

  4. A counter of WJH events showing the total count of WJH events for the selected time range and filters

  5. A pie chart showing Drop Reason distribution for the selected time range and filters

  6. A pie chart showing switch distribution for the selected time range and filters (which switches have WJH events and what their relative share is)

  7. A pie chart showing Packet Type distribution for the selected time range and filters (e.g. Ethernet, IP, Transport, VXLAN)

    Dashboard_Monitoring.png

The following actions are available to navigate the data:

  • Select or enter the data into the various filters on the top area of the dashboard
    This will filter the data and redraw the visualizations accordingly

  • Change the displayed time range using the time range selector on the upper right corner of the screen
    This will filter the data for the selected time range and refresh the visualizations accordingly

  • Zoom into a specific time range by selecting an area on the activity graph
    This will filter the data for the selected time range and refresh the visualizations accordingly

Layer-1 Dashboard

The Layer-1 dashboard allows users to receive detailed information about all the recorded Layer-1 events. The dashboard consists of the following main components (numbered on the screenshot below):

  1. (Dropped Packets due to Layer 1 Errors) An activity graph, showing the recorded Layer-1 events over time

  2. A pie chart showing Layer-1 event distribution for the selected time range and filters

  3. A counter of Layer-1 events which shows the total count of events for the selected time range and filters

  4. A table containing the information about the layer-1 events, such as switch IP, port drop reason, state, and recommended action and various packet fields

    Dashboard_Layer-1.png

The following actions are available to navigate the data:

  • Select or enter the data into the various filters on the top area of the dashboard
    This will filter the data and redraw the visualizations accordingly

  • Change the displayed time range using the time range selector on the upper right corner of the screen
    This will filter the data for the selected time range and refresh the visualizations accordingly

  • Zoom into a specific time range by selecting an area on the activity graph
    This will filter the data for the selected time range and refresh the visualizations accordingly

  • Sort the data in the table by clicking any of the table columns (clicking a column twice will toggle the ordering between ascending and descending).

  • Navigate the table pages by clicking the number of the page at the bottom of the table.

© Copyright 2023, NVIDIA. Last updated on Nov 17, 2023.