WJH Streaming and Integration with Telegraf, InfluxDB and Grafana (TIG) Stack
The Telegraf, InfluxDB and Grafana (TIG) stack is composed of 3 components and used for viewing and analyzing data. The solution provided below is based on TIG components installed on a single container.
TIG is constructed of the following components:
Telegraf: The tool that collects the data from the input with a specific format and forwards it to the InfluxDB
InfluxDB: The database where the data is stored (e.g. the WJH events)
Grafana: The visualization dashboard that presents the received data from the InfluxDB in a graphical manner
The below diagram presents a typical flow for the Grafana for What Just Happened® (WJH): Telemetry Agent will collect the WJH data from the switch and send it to Grafana for WJH container where the data is presented in a visualized manner in the Grafana bundled inside Grafana for WJH container.
Running the Grafana for WJH Container
Deployment of the Grafana for WJH container should be performed on a Linux host or a VM.
Prerequisites:
32 GB RAM
16 CPU
To run the Grafana for WJH container:
Install docker CE on CentOS 7.X:
yum install -y yum-utils device-mapper-persistent-data lvm2 yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo yum -y install --setopt=obsoletes=0 docker-ce docker-ce-cli containerd.io
Start docker.
service docker start
Set docker service to start when the system boots.
systemctl enable docker
Pull the WJHGraf docker image.
If internet connection is available from the host, and you are pulling the image from the docker hub, run the following command:
docker pull mellanox/wjhgraf
WarningYou may find a direct download link from Docker Hub here.
If you do not have internet access and cannot pull the image from the docker hub, use the following procedure to load the image on your host (assuming you have already downloaded it locally).
cp <location>/wjhgraph.img.gz /tmp; docker load -i /tmp/wjhgraph.img.gz
Start your container binding the external ports 3000, 8093.
docker run -dit --name wjhgraf --restart unless-stopped -p 3000:3000 -p 8093:8093 mellanox/wjhgraf
The Mellanox Telemetry Agent runs on a switch container and provides streaming of configurable list of counters, parameters and databases toward an external collector for processing, analysis and presentation.
Before following the procedure below, please make sure to set the date and time on the switch to be identical to the date and time on the TIG server in order for streaming to work properly. For more information on how to do that, please see HowTo Enable NTP on Mellanox Switches.
To deploy the Mellanox Telemetry Agent:
Enable the Docker capability.
switch1 [standalone: master] (config) # no docker shutdown
If you have a previous version of Mellanox Telemetry Agent installed, make sure to remove it prior to enabling your new Mellanox Telemetry Agent:
Stop telemetry agent container:
switch1 [standalone: master] (config) # docker no start telemetry-agent
Remove telemetry agent image:
switch1 [standalone: master] (config) # docker remove image mellanox/telemetry-agent <version>
Disable WJH on the switch.
switch1 [standalone: master] (config) # no what-just-happened auto-export all enable switch1 [standalone: master] (config) # no what-just-happened all enable
Pull the Mellanox Telemetry Agent docker image.
If internet connection is available from the switch, and the image can be pulled from the docker hub.
switch1 [standalone: master] (config) # ntp server [NTP_SERVER] switch1 [standalone: master] (config) # ntp enable switch1 [standalone: master] (config) # docker pull mellanox/telemetry-agent
If internet connection is unavailable, and the image cannot be pulled from the docker hub.
switch1 [standalone: master] (config) # image fetch scp://username:password@<IP address>/<remote path> (For example: image fetch scp://root:123456@10.20.10.105/tmp/telemetry-agent_2.4.0-9.img.gz) switch1 [standalone: master] (config) # docker load <telemetry_agent_image>
Start the container.
If the image was pulled from docker:
switch1 [standalone: master] (config) # docker start mellanox/telemetry-agent latest telemetry-agent now-and-init cpus 0.5 memory 300 privileged network sdk
If the image was not pulled from docker:
switch1 [standalone: master] (config) # docker start telemetry-agent <telemetry_agent_version> <container-name> now-and-init cpus 0.5 memory 300 privileged network sdk (Example of telemetry-agent version: 2.4.0-9)
WarningTo prevent interference with the switch operations, the CPU must be limited to half a core (0.5) and the memory consumption of the telemetry agent container must be limited to 300MB.
The telemetry agent must create trust with the switch in order to allow telemetry on LAGs and MLAGs. Run:
switch (config) # docker exec [docker-instance-name] "/opt/telemetry/utils/create_trust.sh"
Copy the key generated and printed on your screen:
switch (config) # docker exec neo-agent /opt/telemetry/utils/create_trust.sh Running exec_name: [/opt/telemetry/utils/create_trust.sh] Generating public/private rsa key pair. Crated directory '/root/.ssh'. Your identification has been saved in /root/.ssh/id rsa. Your public key has been saved in /root/.ssh/id_ rsa.pub. The key fingerprint is: root@switch The kye's randomart image is: +---[RSA 20 8)-----+ | | | | | | | | | | | | | | | | | | | | +------------------+ ssh-rsa Some1Random2Genraced3Key4Wich5Random6Chars7 rooc@switch
Run the following command:
switch (config) # ssh client user admin authorized-key sshv2 "ssh-rsa Some1Random2Genraced3Key4Wich5Random6Chars7 rooc@switch"
Copy the SDK package from the switch to the container.
switch1 [standalone: master] (config) # docker copy-sdk <container-name> to /
Run initial telemetry configuration:
switch (config) # docker exec <container-name> "bash /opt/telemetry/utils/telemetry_agent_init.sh 127.0.0.1 7654"
Save the configuration.
switch1 [standalone: master] (config) # configuration write
Connect telemetry agent to Grafana collector and start it. To configure the agent to send data to WJHGraf, please refer to Controlling Telemetry Agent > " WJH sessions" and set the parameters for the collector as follows:
Destination IP: <your_WJHGraf_IP>
Destination port: <8093>
Format: <InfluxDB_Line_Protocol >
root@neo-switch:/# /opt/telemetry/session-controller
Connecting to telemetry agent...
Got new connection from: 10.209.37.251:59636
At any point you can type 'return' to go back to main menu or type 'quit' to exit
Telemetry Actions
1. Create session
2. Delete session
3. Delete all sessions
4. Status
Please select action: 1
Telemetry Sessions
1. WJH - Samples the dropped packets buffer
2. Interface counters - Samples interface counters
3. Threshold events - Events generated every time a defined threshold is crossed
4. Histograms - Samples the buffer histograms
Please select session type: 1
Subscribing session WJH...
Filter Settings
1. Default - All events except Forwarding with Notice severity
2. Custom - Custom events filtering for each WJH category
Please select option: 1
Reading collector params...
Please enter destination ip: <your_WJHGraf_IP>
Please enter destination port (default 5123): 8093
Formats
1. JSON
2. Influx DB Line Protocol
3. Protocol Buffers
4. gRPC
Please select format or enter to continue (default Influx DB Line Protocol):
Please enter protocol (TCP/UDP) (default TCP):
Enter "yes" to add new collector or press enter to continue:
To view Grafana for WJH:
Browse to:
http://<WJHGraf_IP>:3000
Log is using admin/admin as your username and password.
WJH visualization consists of the following dashboards:
Debug dashboard
Monitoring dashboard
Debug Dashboard
The Debug dashboard allows users to receive detailed information about all the recorded "What Just Happened?" issues. It consists of 3 main components (numbered in the screenshot below):
The total count of the WJH events for the selected time range and filters.
An activity graph, showing the recorded WJH events over time.
A table containing the information about WJH events, such as switch IP, port, drop categorization and reasons and various packet fields.
The following actions are available to navigate the data:
Select or enter the data into the various filters on the top area of the dashboard.
This will filter the data and redraw the visualizations accordingly.Change the displayed time range using the time range selector on the upper right corner of the screen.
This will filter the data for the selected time range and refresh the visualizations accordingly.Zoom into a specific time range by selecting an area on the activity graph.
This will filter the data for the selected time range and refresh the visualizations accordingly.Sort the data in the table by clicking any of the table columns (clicking a column twice will toggle the ordering between ascending and descending)
Navigate the table pages by clicking the number of the page at the bottom of the table
Monitoring Dashboard
The Monitoring dashboard provides a real-time pane for monitoring WJH activity in the network. The dashboard consists of the following main components (numbered in the screenshot below):
An activity graph, showing the recorded WJH events over time
A pie chart showing Severity distribution for the selected time range and filters
A pie chart showing Drop Category distribution for the selected time range and filters
A counter of WJH events showing the total count of WJH events for the selected time range and filters
A pie chart showing Drop Reason distribution for the selected time range and filters
A pie chart showing switch distribution for the selected time range and filters (which switches have WJH events and what their relative share is)
A pie chart showing Packet Type distribution for the selected time range and filters (e.g. Ethernet, IP, Transport, VXLAN)
The following actions are available to navigate the data:
Select or enter the data into the various filters on the top area of the dashboard
This will filter the data and redraw the visualizations accordinglyChange the displayed time range using the time range selector on the upper right corner of the screen
This will filter the data for the selected time range and refresh the visualizations accordinglyZoom into a specific time range by selecting an area on the activity graph
This will filter the data for the selected time range and refresh the visualizations accordingly
Layer-1 Dashboard
The Layer-1 dashboard allows users to receive detailed information about all the recorded Layer-1 events. The dashboard consists of the following main components (numbered on the screenshot below):
(Dropped Packets due to Layer 1 Errors) An activity graph, showing the recorded Layer-1 events over time
A pie chart showing Layer-1 event distribution for the selected time range and filters
A counter of Layer-1 events which shows the total count of events for the selected time range and filters
A table containing the information about the layer-1 events, such as switch IP, port drop reason, state, and recommended action and various packet fields
The following actions are available to navigate the data:
Select or enter the data into the various filters on the top area of the dashboard
This will filter the data and redraw the visualizations accordinglyChange the displayed time range using the time range selector on the upper right corner of the screen
This will filter the data for the selected time range and refresh the visualizations accordinglyZoom into a specific time range by selecting an area on the activity graph
This will filter the data for the selected time range and refresh the visualizations accordinglySort the data in the table by clicking any of the table columns (clicking a column twice will toggle the ordering between ascending and descending).
Navigate the table pages by clicking the number of the page at the bottom of the table.