If you are using the current version of Cumulus Linux, the content on this page may not be up to date. The current version of the documentation is available here. If you are redirected to the main page of the user guide, then this page may have been renamed; please search for it there.

Monitoring Best Practices

The following monitoring processes are considered best practices for reviewing and troubleshooting potential issues with Cumulus Linux environments. In addition, several of the more common issues have been listed, with potential solutions included.

This document describes:

  • Metrics that you can poll from Cumulus Linux and use in trend analysis
  • Critical log messages that you can monitor for triggered alerts

Trend Analysis Using Metrics

A metric is a quantifiable measure that is used to track and assess the status of a specific infrastructure component. It is a check collectedover time. Examples of metrics include bytes on an interface, CPU utilization, and total number of routes.

Metrics are more valuable when used for trend analysis.

Generate Alerts with Triggered Logging

Triggered issues are normally sent to syslog, but can go to another log file depending on the feature. In Cumulus Linux, rsyslog handles all logging, including local and remote logging. Logs are the best method to use for generating alerts when the system transitions from a stable steady state.

Sending logs to a centralized collector, then creating alerts based on critical logs is an optimal solution for alerting.

Log Formatting

Most log files in Cumulus Linux use a standard presentation format. For example, consider this syslog entry:

2017-03-08T06:26:43.569681+00:00 leaf01 sysmonitor: Critically high CPU use: 99%
  • 2017-03-08T06:26:43.569681+00:00 is the timestamp.
  • leaf01 is the hostname.
  • sysmonitor is the process that is the source of the message.
  • Critically high CPU use: 99% is the message.

For brevity and legibility, the timestamp and hostname have been omitted from the examples in this chapter.

Hardware

The smond process provides monitoring functionality for various switch hardware elements. Minimum or maximum values are output depending on the flags applied to the basic command. The hardware elements and applicable commands and flags are listed in the table below.

Hardware ElementMonitoring CommandsInterval Poll
Temperaturecumulus@switch:~$ smonctl -jcumulus@switch:~$ smonctl -j -s TEMP[X]10 seconds
Fancumulus@switch:~$ smonctl -jcumulus@switch:~$ smonctl -j -s FAN[X]10 seconds
PSUcumulus@switch:~$ smonctl -jcumulus@switch:~$ smonctl -j -s PSU[X]10 seconds
PSU Fancumulus@switch:~$ smonctl -jcumulus@switch:~$ smonctl -j -s PSU[X]Fan[X]10 seconds
PSU Temperaturecumulus@switch:~$ smonctl -jcumulus@switch:~$ smonctl -j -s PSU[X]Temp[X]10 seconds
Voltagecumulus@switch:~$ smonctl -jcumulus@switch:~$ smonctl -j -s Volt[X]10 seconds
Front Panel LEDcumulus@switch:~$ ledmgrd -dcumulus@switch:~$ ledmgrd -jYou can also run net show system leds, which is the NCLU command equivalent of ledmgrd -d.5 seconds

Not all switch models include a sensor for monitoring power consumption and voltage. See this note for details.

Hardware LogsLog LocationLog Entries
High temperature/var/log/syslog/usr/sbin/smond : : Temp1(Board Sensor near CPU): state changed from UNKNOWN to OK/usr/sbin/smond : : Temp2(Board Sensor Near Virtual Switch): state changed from UNKNOWN to OK/usr/sbin/smond : : Temp3(Board Sensor at Front Left Corner): state changed from UNKNOWN to OK/usr/sbin/smond : : Temp4(Board Sensor at Front Right Corner): state changed from UNKNOWN to OK/usr/sbin/smond : : Temp5(Board Sensor near Fan): state changed from UNKNOWN to OK
Fan speed issues/var/log/syslog/usr/sbin/smond : : Fan1(Fan Tray 1, Fan 1): state changed from UNKNOWN to OK/usr/sbin/smond : : Fan2(Fan Tray 1, Fan 2): state changed from UNKNOWN to OK/usr/sbin/smond : : Fan3(Fan Tray 2, Fan 1): state changed from UNKNOWN to OK/usr/sbin/smond : : Fan4(Fan Tray 2, Fan 2): state changed from UNKNOWN to OK/usr/sbin/smond : : Fan5(Fan Tray 3, Fan 1): state changed from UNKNOWN to OK/usr/sbin/smond : : Fan6(Fan Tray 3, Fan 2): state changed from UNKNOWN to OK
PSU failure/var/log/syslog/usr/sbin/smond : : PSU1Fan1(PSU1 Fan): state changed from UNKNOWN to OK/usr/sbin/smond : : PSU2Fan1(PSU2 Fan): state changed from UNKNOWN to BAD

System Data

Cumulus Linux includes a number of ways to monitor various aspects of system data. In addition, alerts are issued in high risk situations.

CPU Idle Time

When a CPU reports five high CPU alerts within a span of five minutes, an alert is logged.

Short bursts of high CPU can occur during switchd churn or routing protocol startup. Do not set alerts for these short bursts.

System ElementMonitoring CommandsInterval Poll
CPU utilizationcumulus@switch:~$ cat /proc/statcumulus@switch:~$ top -b -n 130 seconds
CPU LogsLog LocationLog Entries
High CPU/var/log/syslogsysmonitor: Critically high CPU use: 99%systemd[1]: Starting Monitor system resources (cpu, memory, disk)…systemd[1]: Started Monitor system resources (cpu, memory, disk).sysmonitor: High CPU use: 89%systemd[1]: Starting Monitor system resources (cpu, memory, disk)…systemd[1]: Started Monitor system resources (cpu, memory, disk).sysmonitor: CPU use no longer high: 77%

Cumulus Linux 3.0 and later monitors CPU, memory, and disk space via sysmonitor. The configurations for the thresholds are stored in /etc/cumulus/sysmonitor.conf. More information is available with man sysmonitor.

CPU measureThresholds
UseAlert: 90% Crit: 95%
Process LoadAlarm: 95% Crit: 125%

Disk Usage

When monitoring disk utilization, you can exclude tmpfs from monitoring.

System ElementMonitoring CommandsInterval Poll
Disk utilizationcumulus@switch:~$ /bin/df -x tmpfs300 seconds

Process Restart

In Cumulus Linux, systemd is responsible for monitoring and restarting processes.

Process ElementMonitoring Commands
View processes monitored by systemdcumulus@switch:~$ systemctl status

Layer 1 Protocols and Interfaces

Link and port state interface transitions are logged to /var/log/syslog and /var/log/switchd.log.

Interface ElementMonitoring Commands
Link statecumulus@switch:~$ cat /sys/class/net/[iface]/operstatecumulus@switch:~$ net show interface all json
Link speedcumulus@switch:~$ cat /sys/class/net/[iface]/speedcumulus@switch:~$ net show interface all json
Port statecumulus@switch:~$ ip link showcumulus@switch:~$ net show interface all json
Bond statecumulus@switch:~$ cat /proc/net/bonding/[bond]cumulus@switch:~$ net show interface all json

Interface counters are obtained from either querying the hardware or the Linux kernel. The two outputs should align, but the Linux kernel aggregates the output from the hardware.

Interface Counter ElementMonitoring CommandsInterval Poll
Interface counterscumulus@switch:~$ cat /sys/class/net/[iface]/statistics/[stat_name]cumulus@switch:~$ net show counters jsoncumulus@switch:~$ cl-netstat -jcumulus@switch:~$ ethtool -S [ iface]10 seconds
Layer 1 LogsL og LocationLog Entries
Link failure/Link flap/var/log/switchd.logswitchd[5692]: nic.c:213 nic_set_carrier: swp17: setting kernel carrier: downswitchd[5692]: netlink.c:291 libnl: swp1, family 0, ifi 20, oper downswitchd[5692]: nic.c:213 nic_set_carrier: swp1: setting kernel carrier: upswitchd[5692]: netlink.c:291 libnl: swp17, family 0, ifi 20, oper up
Unidirectional link/var/log/switchd.log/var/log/ptm.logptmd[7146]: ptm_bfd.c:2471 Created new session 0x1 with peer 10.255.255.11 port swp1ptmd[7146]: ptm_bfd.c:2471 Created new session 0x2 with peer fe80::4638:39ff:fe00:5b port swp1ptmd[7146]: ptm_bfd.c:2471 Session 0x1 down to peer 10.255.255.11, Reason 8ptmd[7146]: ptm_bfd.c:2471 Detect timeout on session 0x1 with peer 10.255.255.11, in state 1
Bond Negotiation Working/var/log/syslogkernel: [85412.763193] bonding: bond0 is being created…kernel: [85412.770014] bond0: Enslaving swp2 as a backup interface with an up linkkernel: [85412.775216] bond0: Enslaving swp1 as a backup interface with an up linkkernel: [85412.797393] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not readykernel: [85412.799425] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
Bond Negotiation Failing/var/log/syslogkernel: [85412.763193] bonding: bond0 is being created…kernel: [85412.770014] bond0: Enslaving swp2 as a backup interface with an up linkkernel: [85412.775216] bond0: Enslaving swp1 as a backup interface with an up linkkernel: [85412.797393] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
MLAG peerlink negotiation Working/var/log/sysloglldpd[998]: error while receiving frame on swp50: Network is downlldpd[998]: error while receiving frame on swp49: Network is downkernel: [76174.262893] peerlink: Setting ad_actor_system to 44:38:39:00:00:11kernel: [76174.264205] 8021q: adding VLAN 0 to HW filter on device peerlinkmstpd: one_clag_cmd: setting (1) peer link: peerlinkmstpd: one_clag_cmd: setting (1) clag state: upmstpd: one_clag_cmd: setting system-mac 44:38:39:ff:40:94mstpd: one_clag_cmd: setting clag-role secondary
/var/log/clagd.logclagd[14003]: Cleanup is executing.clagd[14003]: Cannot open file “/tmp/pre-clagd.q7XiOclagd[14003]: Cleanup is finishedclagd[14003]: Beginning execution of clagd version 1clagd[14003]: Invoked with: /usr/sbin/clagd –daemonclagd[14003]: Role is now secondaryclagd[14003]: HealthCheck: role via backup is secondclagd[14003]: HealthCheck: backup activeclagd[14003]: Initial config loadedclagd[14003]: The peer switch is active.clagd[14003]: Initial data sync from peer done.clagd[14003]: Initial handshake done.clagd[14003]: Initial data sync to peer done.
MLAG peerlink negotiation Failing/var/log/sysloglldpd[998]: error while receiving frame on swp50: Network is downlldpd[998]: error while receiving frame on swp49: Network is downkernel: [76174.262893] peerlink: Setting ad_actor_system to 44:38:39:00:00:11kernel: [76174.264205] 8021q: adding VLAN 0 to HW filter on device peerlinkmstpd: one_clag_cmd: setting (1) peer link: peerlinkmstpd: one_clag_cmd: setting (1) clag state: downmstpd: one_clag_cmd: setting system-mac 44:38:39:ff:40:94mstpd: one_clag_cmd: setting clag-role secondary
/var/log/clagd.logclagd[26916]: Cleanup is executing.clagd[26916]: Cannot open file “/tmp/pre-clagd.6M527vvGX0/brbatch” for reading: No such file or directoryclagd[26916]: Cleanup is finishedclagd[26916]: Beginning execution of clagd version 1.3.0clagd[26916]: Invoked with: /usr/sbin/clagd –daemon 169.254.1.2 peerlink.4094 44:38:39:FF:01:01 –priority 1000 –backupIp 10.0.0.2clagd[26916]: Role is now secondaryclagd[26916]: Initial config loaded
MLAG port negotiation Working/var/log/syslogkernel: [77419.112195] bonding: server01 is being created…lldpd[998]: error while receiving frame on swp1: Network is downkernel: [77419.122707] 8021q: adding VLAN 0 to HW filter on device swp1kernel: [77419.126408] server01: Enslaving swp1 as a backup interface with a down linkkernel: [77419.177175] server01: Setting ad_actor_system to 44:38:39:ff:40:94kernel: [77419.190874] server01: Warning: No 802.3ad response from the link partner for any adapters in the bondkernel: [77419.191448] IPv6: ADDRCONF(NETDEV_UP): server01: link is not readykernel: [77419.191452] 8021q: adding VLAN 0 to HW filter on device server01kernel: [77419.192060] server01: link status definitely up for interface swp1, 1000 Mbps full duplexkernel: [77419.192065] server01: now running without any active interface!kernel: [77421.491811] IPv6: ADDRCONF(NETDEV_CHANGE): server01: link becomes readymstpd: one_clag_cmd: setting (1) mac 44:38:39:00:00:17 <server01, None>
/var/log/clagd.logclagd[14003]: server01 is now dual connected.
MLAG port negotiation Failing/var/log/syslogkernel: [79290.290999] bonding: server01 is being created…kernel: [79290.299645] 8021q: adding VLAN 0 to HW filter on device swp1kernel: [79290.301790] server01: Enslaving swp1 as a backup interface with a down linkkernel: [79290.358294] server01: Setting ad_actor_system to 44:38:39:ff:40:94kernel: [79290.373590] server01: Warning: No 802.3ad response from the link partner for any adapters in the bondkernel: [79290.374024] IPv6: ADDRCONF(NETDEV_UP): server01: link is not readykernel: [79290.374028] 8021q: adding VLAN 0 to HW filter on device server01kernel: [79290.375033] server01: link status definitely up for interface swp1, 1000 Mbps full duplexkernel: [79290.375037] server01: now running without any active interface!
/var/log/clagd.logclagd[14291]: Conflict (server01): matching clag-id (1) not configured on peer…clagd[14291]: Conflict cleared (server01): matching clag-id (1) detected on peer
MLAG port negotiation Flapping/var/log/syslogmstpd: one_clag_cmd: setting (0) mac 00:00:00:00:00:00 <server01, None>mstpd: one_clag_cmd: setting (1) mac 44:38:39:00:00:03 <server01, None>
/var/log/clagd.logclagd[14291]: server01 is no longer dual connectedclagd[14291]: server01 is now dual connected.

Prescriptive Topology Manager (PTM) uses LLDP information to compare against a topology.dot file that describes the network. It has built in alerting capabilities, so it is preferable to use PTM on box rather than polling LLDP information regularly. The PTM code is available in the Cumulus Networks GitHub repository. Additional PTM, BFD, and associated logs are documented in the code.

Tracking peering information through PTM is highly recommended. For more information, refer to the Prescriptive Topology Manager documentation.

Neighbor ElementMonitoring CommandsInterval Poll
LLDP Neighborcumulus@switch:~$ lldpctl -f json300 seconds
Prescriptive Topology Managercumulus@switch:~$ ptmctl -j [-d]Triggered

Layer 2 Protocols

Spanning tree is a protocol that prevents loops in a layer 2 infrastructure. In a stable state, the spanning tree protocol should stably converge. Monitoring the Topology Change Notifications (TCN) in STP helps identify when new BPDUs are received.

Interface Counter ElementMonitoring CommandsInterval Poll
STP TCN Transitionscumulus@switch:~$ mstpctl showbridge jsoncumulus@switch:~$ mstpctl showport json60 seconds
MLAG peer statecumulus@switch:~$ clagctl statuscumulus@switch:~$ clagd -jcumulus@switch:~$ cat /var/log/clagd.log60 seconds
MLAG peer MACscumulus@switch:~$ clagctl dumppeermacscumulus@switch:~$ clagctl dumpourmacs300 seconds
Layer 2 LogsLog LocationLog Entries
Spanning Tree Working/var/log/syslogkernel: [1653877.190724] device swp1 entered promiscuous modekernel: [1653877.190796] device swp2 entered promiscuous modemstpd: create_br: Add bridge bridgemstpd: clag_set_sys_mac_br: set bridge mac 00:00:00:00:00:00mstpd: create_if: Add iface swp1 as port#2 to bridge bridgemstpd: set_if_up: Port swp1 : upmstpd: create_if: Add iface swp2 as port#1 to bridge bridgemstpd: set_if_up: Port swp2 : upmstpd: set_br_up: Set bridge bridge upmstpd: MSTP_OUT_set_state: bridge:swp1:0 entering blocking state(Disabled)mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Disabled)mstpd: MSTP_OUT_flush_all_fids: bridge:swp1:0 Flushing forwarding databasemstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding databasemstpd: MSTP_OUT_set_state: bridge:swp1:0 entering learning state(Designated)mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering learning state(Designated)sudo: pam_unix(sudo:session): session closed for user rootmstpd: MSTP_OUT_set_state: bridge:swp1:0 entering forwarding state(Designated)mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering forwarding state(Designated)mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding databasemstpd: MSTP_OUT_flush_all_fids: bridge:swp1:0 Flushing forwarding database
Spanning Tree Blocking/var/log/syslogmstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Designated)mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering learning state(Designated)mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering forwarding state(Designated)mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding databasemstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding databasemstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Alternate)mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database

Layer 3 Protocols

When FRRouting boots up for the first time, there is a different log file for each daemon that is activated. If the log file is ever edited (for example, through vtysh or frr.conf), the integrated configuration sends all logs to the same file.

To send FRRouting logs to syslog, apply the configuration log syslog in vtysh.

BGP

When monitoring BGP, check if BGP peers are operational. There is not much value in alerting on the current operational state of the peer; monitoring the transition is more valuable, which you can do by monitoring syslog.

Monitoring the routing table provides trending on the size of the infrastructure. This is especially useful when integrated with host-based solutions (such as Routing on the Host) when the routes track with the number of applications available.

BGP ElementMonitoring CommandsInterval Poll
BGP peer failurecumulus@switch:~$ sudo vtysh -c “show ip bgp summary json”cumulus@switch:~$ net show bgp summary json60 seconds
BGP route tablecumulus@switch:~$ sudo vtysh -c “show ip bgp json”cumulus@switch:~$ net show route bgp json600 seconds
BGP LogsLog LocationLog Entries
BGP peer down/var/log/syslog/var/log/frr/*.logbgpd[3000]: %NOTIFICATION: sent to neighbor swp1 4/0 (Hold Timer Expired) 0 bytesbgpd[3000]: %ADJCHANGE: neighbor swp1 Down BGP Notification send

OSPF

When monitoring OSPF, check if OSPF peers are operational. There is not much value in alerting on the current operational state of the peer; monitoring the transition is more valuable, which you can do by monitoring syslog.

Monitoring the routing table provides trending on the size of the infrastructure. This is especially useful when integrated with host-based solutions (such as Routing on the Host) when the routes track with the number of applications available.

OSPF ElementMonitoring CommandsInterval Poll
OSPF protocol peer failurecumulus@switch:~$ sudo vtysh -c “show ip ospf neighbor all json”cumulus@switch:~$ cl-ospf summary show json60 seconds
OSPF link state databasecumulus@switch:~$ sudo vtysh - c “show ip ospf database”600 seconds

Route and Host Entries

Route ElementMonitoring CommandsInterval Poll
Host Entriescumulus@switch:~$ cl-resource-querycumulus@switch:~$ cl-resource-query -k600 seconds
Route Entriescumulus@switch:~$ cl-resource-querycumulus@switch:~$ cl-resource-query -k600 seconds

You can also run the net show system asic command, which is the NCLU command equivalent of cl-resource-query.

Routing Logs

Layer 3 LogsLog LocationLog Entries
Routing protocol process crash/var/log/syslogfrrouting[1824]: Starting FRRouting daemons (prio:10):. zebra. bgpd.bgpd[1847]: BGPd 1.0.0+cl3u7 starting: vty@2605, bgp@:179zebra[1840]: client 12 says hello and bids fair to announce only bgp routeswatchfrr[1853]: watchfrr 1.0.0+cl3u7 watching [zebra bgpd], mode [phased zebra restart]watchfrr[1853]: bgpd state -> up : connect succeededwatchfrr[1853]: bgpd state -> down : read returned EOFcumulus-core: Running cl-support for core files bgpd.3030.1470341944.core.core_helpercore_check.sh[4992]: Please send /var/support/cl_support__spine01_20160804_201905.tar.xz to Cumulus supportwatchfrr[1853]: Forked background command [pid 6665]: /usr/sbin/service frr restart bgpdwatchfrr[1853]: watchfrr 0.99.24+cl3u2 watching [zebra bgpd ospfd], mode [phased zebra restart]watchfrr[1853]: zebra state -> up : connect succeededwatchfrr[1853]: bgpd state -> up : connect succeededwatchfrr[1853]: watchfrr: Notifying Systemd we are up and running

Logging

The table below describes the various log files.

Logging ElementMonitoring CommandsLog Location
syslogCatch all log file. Identifies memory leaks and CPU spikes./var/log/syslog
switchd functionalityHardware Abstraction Layer (HAL)./var/log/switchd.log
Routing daemonsFRRouting zebra daemon details./var/log/daemon.log
Routing protocolThe log file is configurable in FRRouting. When FRRouting first boots, it uses the non-integrated configuration so each routing protocol has its own log file. After booting up, FRRouting switches over to using the integrated configuration, so that all logs go to a single place.To edit the location of the log files, use the log file command. By default, FRRouting logs are not sent to syslog. Use the log syslog command to send logs through rsyslog and into /var/log/syslog.Note: To write syslog debug messages to the log file, you must run the log syslog debug command to configure FRR with syslog severity 7 (debug); otherwise, when you issue a debug command such as, debug bgp neighbor-events, no output is sent to /var/log/frr/frr.log.However, when you manually define a log target with the log file /var/log/frr/debug.log command, FRR automatically defaults to severity 7 (debug) logging and the output is logged to /var/log/frr/frr.log./var/log/frr/zebra.log/var/log/frr/{protocol}.log/var/log/frr/frr.log

Protocols and Services

Run the following command to confirm that the NTP process is working correctly and that the switch clock is in sync with NTP:

cumulus@switch:~$ /usr/bin/ntpq -p

Device Management

Device Access Logs

Access LogsLog LocationLog Entries
User Authentication and Remote Login/var/log/syslogsshd[31830]: Accepted publickey for cumulus from 192.168.0.254 port 45582 ssh2: RSA 38:e6:3b:cc:04:ac:41:5e:c9:e3:93:9d:cc:9e:48:25sshd[31830]: pam_unix(sshd:session): session opened for user cumulus by (uid=0)

Device Super User Command Logs

Super User Command LogsLog LocationLog Entries
Executing commands using sudo/var/log/syslogsudo: cumulus: TTY=unknown ; PWD=/home/cumulus ; USER=root ; COMMAND=/tmp/script_9938.sh -vsudo: pam_unix(sudo:session): session opened for user root by (uid=0)sudo: pam_unix(sudo:session): session closed for user root