Monitoring Best Practices

The following monitoring processes are best practices for reviewing and troubleshooting potential issues with Cumulus Linux environments.

This document describes:

Metrics that you can poll from Cumulus Linux and use in trend analysis
Critical log messages that you can monitor for triggered alerts

Trend Analysis Using Metrics

A metric is a quantifiable measure that tracks and assesses the status of a specific infrastructure component. Examples of metrics include bytes on an interface, CPU utilization, and total number of routes.

Metrics are more valuable when you use them for trend analysis.

Generate Alerts with Triggered Logging

Cumulus Linux typically sends triggered issues to syslog, but can send issues to another log file depending on the feature. rsyslog handles all logging, including local and remote logging. Logs are the best method to use for generating alerts when the system transitions from a stable steady state.

Sending logs to a centralized collector, then creating alerts that you base on critical logs is an optimal solution.

Log Formatting

Most log files in Cumulus Linux use a standard presentation format. For example:

2017-03-08T06:26:43.569681+00:00 leaf01 sysmonitor: Critically high CPU use: 99%

2017-03-08T06:26:43.569681+00:00 is the timestamp.
leaf01 is the hostname.
sysmonitor is the process that is the source of the message.
Critically high CPU use: 99% is the message.

For brevity and legibility, this section omits the timestamp and hostname from examples.

Hardware

NVUE Commands

NVUE provides commands to monitor various switch hardware elements.

Command	Description
`nv show platform environment fan`	Shows information about the fans on the switch, such as the minimum, maximum and current speed, the fan state, and the fan direction.
`nv show platform environment led`	Shows information about the LEDs on the switch, such as the LED name and color.
`nv show platform environment psu`	Shows information about the PSUs on the switch, such as the PSU name and state.
`nv show platform environment temperature`	Shows information about the sensors on the switch, such as the critical, maximum, minimum and current temperature and the current state of the sensor.
`nv show platform environment voltage`	Shows the list of voltage sensors on the switch.
`nv show platform inventory`	Shows the switch inventory, which includes fan and PSU hardware version, model, serial number, state, and type. For information about a specific fan or PSU, run the `nv show platform inventory <inventory-name>` command.

The following example shows the nv show platform environment fan command output. The airflow direction must be the same for all fans. If Cumulus Linux detects that the fan airflow direction is not uniform, it logs a message in the var/log/syslog file.

cumulus@switch:~$ nv show platform environment fan
Name      Fan State  Current Speed (RPM)  Max Speed  Min Speed  Fan Direction
--------  ---------  -------------------  ---------  ---------  -------------
FAN1/1    ok         6000                 29000      2500       F2B         
FAN1/2    ok         6000                 29000      2500       F2B         
FAN2/1    ok         6000                 29000      2500       F2B         
FAN2/2    ok         6000                 29000      2500       F2B         
FAN3/1    ok         6000                 29000      2500       F2B         
FAN3/2    ok         6000                 29000      2500       F2B         
PSU1/FAN  ok         6000                 29000      2500       F2B         
PSU2/FAN  ok         6000                 29000      2500       F2B

If the airflow direction for all fans is not in the same (front to back or back to front), cooling is suboptimal for the switch, rack, and even the entire data center.

smond

The smond process provides monitoring for various switch hardware elements. Minimum or maximum values depend on the flags you apply to the basic command. The table below lists the hardware elements and applicable commands and flags.

Hardware Element	Monitoring Commands	Interval Poll
Temperature	`smonctl -j` `smonctl -j -s TEMP[X]`	10 seconds
Fan	`smonctl -j` `smonctl -j -s FAN[X]`	10 seconds
PSU	`smonctl -j` `smonctl -j -s PSU[X]`	10 seconds
PSU Fan	`smonctl -j` `smonctl -j -s PSU[X]Fan[X]`	10 seconds
PSU Temperature	`smonctl -j` `smonctl -j -s PSU[X]Temp[X]`	10 seconds
Voltage	`smonctl -j` `smonctl -j -s Volt[X]`	10 seconds
Front Panel LED	`ledmgrd -d` `ledmgrd -j`	5 seconds

Not all switch models include a sensor for monitoring power consumption and voltage. See this note for details.

Hardware Logs	Log Location	Log Entries
High temperature	/var/log/syslog	/usr/sbin/smond : : Temp1(Board Sensor near CPU): state changed from UNKNOWN to OK /usr/sbin/smond : : Temp2(Board Sensor Near Virtual Switch): state changed from UNKNOWN to OK /usr/sbin/smond : : Temp3(Board Sensor at Front Left Corner): state changed from UNKNOWN to OK /usr/sbin/smond : : Temp4(Board Sensor at Front Right Corner): state changed from UNKNOWN to OK /usr/sbin/smond : : Temp5(Board Sensor near Fan): state changed from UNKNOWN to OK
Fan speed issues	/var/log/syslog	/usr/sbin/smond : : Fan1(Fan Tray 1, Fan 1): state changed from UNKNOWN to OK /usr/sbin/smond : : Fan2(Fan Tray 1, Fan 2): state changed from UNKNOWN to OK /usr/sbin/smond : : Fan3(Fan Tray 2, Fan 1): state changed from UNKNOWN to OK /usr/sbin/smond : : Fan4(Fan Tray 2, Fan 2): state changed from UNKNOWN to OK /usr/sbin/smond : : Fan5(Fan Tray 3, Fan 1): state changed from UNKNOWN to OK /usr/sbin/smond : : Fan6(Fan Tray 3, Fan 2): state changed from UNKNOWN to OK
Fan direction issue	/var/log/syslog	/usr/sbin/smond : : Fan direction mismatch: 12 fans B2F; 1 fans F2B!
PSU failure	/var/log/syslog	/usr/sbin/smond : : PSU1Fan1(PSU1 Fan): state changed from UNKNOWN to OK /usr/sbin/smond : : PSU2Fan1(PSU2 Fan): state changed from UNKNOWN to BAD

System Data

Cumulus Linux includes several ways to monitor system data. In addition, you can receive alerts in high risk situations.

CPU Idle Time

When a CPU reports five high CPU alerts within a span of five minutes, the switch logs an alert.

Short bursts of high CPU can occur during switchd churn or routing protocol startup. Do not set alerts for these short bursts.

System Element	Monitoring Commands	Interval Poll
CPU utilization	NVUE: `nv show system cpu` Linux: `sudo cat /proc/stat` `top -b -n 1`	30 seconds

CPU Logs	Log Location	Log Entries
High CPU	/var/log/syslog	sysmonitor: Critically high CPU use: 99% systemd[1]: Starting Monitor system resources (cpu, memory, disk)… systemd[1]: Started Monitor system resources (cpu, memory, disk). sysmonitor: High CPU use: 89% systemd[1]: Starting Monitor system resources (cpu, memory, disk)… systemd[1]: Started Monitor system resources (cpu, memory, disk). sysmonitor: CPU use no longer high: 77%

CPU Logs

Log Location

Log Entries

High CPU

/var/log/syslog

sysmonitor: Critically high CPU use: 99%
systemd[1]: Starting Monitor system resources (cpu, memory, disk)…
systemd[1]: Started Monitor system resources (cpu, memory, disk).
sysmonitor: High CPU use: 89%
systemd[1]: Starting Monitor system resources (cpu, memory, disk)…
systemd[1]: Started Monitor system resources (cpu, memory, disk).
sysmonitor: CPU use no longer high: 77%

Cumulus Linux monitors CPU, memory, and disk space with sysmonitor. The configurations for the thresholds are in /etc/cumulus/sysmonitor.conf. For more information, see man sysmonitor.

CPU measure	Thresholds
Use	Alert: 90% Crit: 95%
Process Load	Alarm: 95% Crit: 125%

Spectrum 1 CPUs can become overloaded at moderate to high network scale. If your Spectrum 1 switch is not able to process CPU-destined traffic or is running continually at high CPU, either reduce the scale of the network where you deploy Spectrum 1 switches or replace the switch with a newer generation switch that offers stronger compute resources.

Disk Usage

When monitoring disk utilization, you can exclude tmpfs from monitoring.

System Element	Monitoring Commands	Interval Poll
Disk utilization	`/bin/df -x tmpfs`	300 seconds

Process Restart

In Cumulus Linux, systemd monitors and restarts processes.

Process Element	Monitoring Commands
View processes that `systemd` monitors	`systemctl status`

Layer 1 Protocols and Interfaces

Link and port state interface transitions log to /var/log/syslog and /var/log/switchd.log.

Interface Element	Monitoring Commands
Link state	NVUE: `nv show interface <interface>` Linux: `sudo cat /sys/class/net/<interface>/operstate`
Link speed	NVUE: `nv show interface <inteface>` Linux: `sudo cat /sys/class/net/<interface>/speed`
Port state	NVUE: `nv show interface` Linux: `ip link show`
Bond state	NVUE: `nv show interface <bond>` Linux: `sudo cat /proc/net/bonding/<bond>`

You obtain interface counters from either querying the hardware or the Linux kernel. The Linux kernel aggregates the output from the hardware.

Interface Counter Element	Monitoring Commands	Interval Poll
Interface counters	NVUE: `nv show interface <interface> counters` Linux: `cat /sys/class/net/<interface>/statistics/<statistic-name>` `cl-netstat -j` `ethtool -S <interface>`	10 seconds

Layer 1 Logs	Log Location	Log Entries
Link failure/Link flap	/var/log/switchd.log	switchd[5692]: nic.c:213 nic_set_carrier: swp17: setting kernel carrier: down switchd[5692]: netlink.c:291 libnl: swp1, family 0, ifi 20, oper down switchd[5692]: nic.c:213 nic_set_carrier: swp1: setting kernel carrier: up switchd[5692]: netlink.c:291 libnl: swp17, family 0, ifi 20, oper up
Unidirectional link	/var/log/switchd.log /var/log/ptm.log	ptmd[7146]: ptm_bfd.c:2471 Created new session 0x1 with peer 10.255.255.11 port swp1 ptmd[7146]: ptm_bfd.c:2471 Created new session 0x2 with peer fe80::4638:39ff:fe00:5b port swp1 ptmd[7146]: ptm_bfd.c:2471 Session 0x1 down to peer 10.255.255.11, Reason 8 ptmd[7146]: ptm_bfd.c:2471 Detect timeout on session 0x1 with peer 10.255.255.11, in state 1
Bond Negotiation Working	/var/log/syslog	kernel: [85412.763193] bonding: bond0 is being created… kernel: [85412.770014] bond0: Enslaving swp2 as a backup interface with an up link kernel: [85412.775216] bond0: Enslaving swp1 as a backup interface with an up link kernel: [85412.797393] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready kernel: [85412.799425] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
Bond Negotiation Failing	/var/log/syslog	kernel: [85412.763193] bonding: bond0 is being created… kernel: [85412.770014] bond0: Enslaving swp2 as a backup interface with an up link kernel: [85412.775216] bond0: Enslaving swp1 as a backup interface with an up link kernel: [85412.797393] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
MLAG peerlink negotiation Working	/var/log/syslog	lldpd[998]: error while receiving frame on swp50: Network is down lldpd[998]: error while receiving frame on swp49: Network is down kernel: [76174.262893] peerlink: Setting ad_actor_system to 44:38:39:00:00:11 kernel: [76174.264205] 8021q: adding VLAN 0 to HW filter on device peerlink mstpd: one_clag_cmd: setting (1) peer link: peerlink mstpd: one_clag_cmd: setting (1) clag state: up mstpd: one_clag_cmd: setting system-mac 44:38:39:ff:40:94 mstpd: one_clag_cmd: setting clag-role secondary
	/var/log/clagd.log	clagd[14003]: Cleanup is executing. clagd[14003]: Cannot open file “/tmp/pre-clagd.q7XiO clagd[14003]: Cleanup is finished clagd[14003]: Beginning execution of clagd version 1 clagd[14003]: Invoked with: /usr/sbin/clagd –daemon clagd[14003]: Role is now secondary clagd[14003]: HealthCheck: role via backup is second clagd[14003]: HealthCheck: backup active clagd[14003]: Initial config loaded clagd[14003]: The peer switch is active. clagd[14003]: Initial data sync from peer done. clagd[14003]: Initial handshake done. clagd[14003]: Initial data sync to peer done.
MLAG peerlink negotiation Failing	/var/log/syslog	lldpd[998]: error while receiving frame on swp50: Network is down lldpd[998]: error while receiving frame on swp49: Network is down kernel: [76174.262893] peerlink: Setting ad_actor_system to 44:38:39:00:00:11 kernel: [76174.264205] 8021q: adding VLAN 0 to HW filter on device peerlink mstpd: one_clag_cmd: setting (1) peer link: peerlink mstpd: one_clag_cmd: setting (1) clag state: down mstpd: one_clag_cmd: setting system-mac 44:38:39:ff:40:94 mstpd: one_clag_cmd: setting clag-role secondary
	/var/log/clagd.log	clagd[26916]: Cleanup is executing. clagd[26916]: Cannot open file “/tmp/pre-clagd.6M527vvGX0/brbatch” for reading: No such file or directory clagd[26916]: Cleanup is finished clagd[26916]: Beginning execution of clagd version 1.3.0 clagd[26916]: Invoked with: /usr/sbin/clagd –daemon 169.254.1.2 peerlink.4094 44:38:39:FF:01:01 –priority 1000 –backupIp 10.0.0.2 clagd[26916]: Role is now secondary clagd[26916]: Initial config loaded
MLAG port negotiation Working	/var/log/syslog	kernel: [77419.112195] bonding: server01 is being created… lldpd[998]: error while receiving frame on swp1: Network is down kernel: [77419.122707] 8021q: adding VLAN 0 to HW filter on device swp1 kernel: [77419.126408] server01: Enslaving swp1 as a backup interface with a down link kernel: [77419.177175] server01: Setting ad_actor_system to 44:38:39:ff:40:94 kernel: [77419.190874] server01: Warning: No 802.3ad response from the link partner for any adapters in the bond kernel: [77419.191448] IPv6: ADDRCONF(NETDEV_UP): server01: link is not ready kernel: [77419.191452] 8021q: adding VLAN 0 to HW filter on device server01 kernel: [77419.192060] server01: link status definitely up for interface swp1, 1000 Mbps full duplex kernel: [77419.192065] server01: now running without any active interface! kernel: [77421.491811] IPv6: ADDRCONF(NETDEV_CHANGE): server01: link becomes ready mstpd: one_clag_cmd: setting (1) mac 44:38:39:00:00:17 <server01, None>
	/var/log/clagd.log	clagd[14003]: server01 is now dual connected.
MLAG port negotiation Failing	/var/log/syslog	kernel: [79290.290999] bonding: server01 is being created… kernel: [79290.299645] 8021q: adding VLAN 0 to HW filter on device swp1 kernel: [79290.301790] server01: Enslaving swp1 as a backup interface with a down link kernel: [79290.358294] server01: Setting ad_actor_system to 44:38:39:ff:40:94 kernel: [79290.373590] server01: Warning: No 802.3ad response from the link partner for any adapters in the bond kernel: [79290.374024] IPv6: ADDRCONF(NETDEV_UP): server01: link is not ready kernel: [79290.374028] 8021q: adding VLAN 0 to HW filter on device server01 kernel: [79290.375033] server01: link status definitely up for interface swp1, 1000 Mbps full duplex kernel: [79290.375037] server01: now running without any active interface!
	/var/log/clagd.log	clagd[14291]: Conflict (server01): matching clag-id (1) not configured on peer… clagd[14291]: Conflict cleared (server01): matching clag-id (1) detected on peer
MLAG port negotiation Flapping	/var/log/syslog	mstpd: one_clag_cmd: setting (0) mac 00:00:00:00:00:00 <server01, None> mstpd: one_clag_cmd: setting (1) mac 44:38:39:00:00:03 <server01, None>
	/var/log/clagd.log	clagd[14291]: server01 is no longer dual connected clagd[14291]: server01 is now dual connected.

PTM uses LLDP information to compare against a topology.dot file that describes the network. It has built in alerting capabilities. Use PTM on the switch instead of polling LLDP information regularly. You can install PTM from the Cumulus Linux GitHub repository.

Consider tracking peering information through PTM. For more information, refer to the Prescriptive Topology Manager documentation.

Neighbor Element	Monitoring Commands	Interval Poll
LLDP Neighbor	`sudo lldpctl -f json`	300 seconds
Prescriptive Topology Manager	`ptmctl -j`	Triggered

Layer 2 Protocols

Spanning tree is a protocol that prevents loops in a layer 2 infrastructure. In a stable state, the spanning tree protocol converges. Monitor the Topology Change Notifications (TCN) in STP to identify when new BPDUs arrive.

Interface Counter Element	Monitoring Commands	Interval Poll
STP TCN Transitions	NVUE: `nv show bridge domain <bridge> stp` Linux: `mstpctl showbridge json` `mstpctl showport`	60 seconds
MLAG peer state	NVUE: `nv show mlag` Linux: `clagctl status` `sudo clagd -j` `sudo cat /var/log/clagd.log`	60 seconds
MLAG peer MACs	NVUE: `nv show mlag` Linux: `clagctl dumppeermacs` `clagctl dumpourmacs`	300 seconds

Layer 2 Logs	Log Location	Log Entries
Spanning Tree Working	/var/log/syslog	kernel: [1653877.190724] device swp1 entered promiscuous mode kernel: [1653877.190796] device swp2 entered promiscuous mode mstpd: create_br: Add bridge bridge mstpd: clag_set_sys_mac_br: set bridge mac 00:00:00:00:00:00 mstpd: create_if: Add iface swp1 as port#2 to bridge bridge mstpd: set_if_up: Port swp1 : up mstpd: create_if: Add iface swp2 as port#1 to bridge bridge mstpd: set_if_up: Port swp2 : up mstpd: set_br_up: Set bridge bridge up mstpd: MSTP_OUT_set_state: bridge:swp1:0 entering blocking state(Disabled) mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Disabled) mstpd: MSTP_OUT_flush_all_fids: bridge:swp1:0 Flushing forwarding database mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database mstpd: MSTP_OUT_set_state: bridge:swp1:0 entering learning state(Designated) mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering learning state(Designated) sudo: pam_unix(sudo:session): session closed for user root mstpd: MSTP_OUT_set_state: bridge:swp1:0 entering forwarding state(Designated) mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering forwarding state(Designated) mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database mstpd: MSTP_OUT_flush_all_fids: bridge:swp1:0 Flushing forwarding database
Spanning Tree Blocking	/var/log/syslog	mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Designated) mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering learning state(Designated) mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering forwarding state(Designated) mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Alternate) mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database

Layer 2 Logs

Log Location

Log Entries

Spanning Tree Working

/var/log/syslog

kernel: [1653877.190724] device swp1 entered promiscuous mode
kernel: [1653877.190796] device swp2 entered promiscuous mode
mstpd: create_br: Add bridge bridge
mstpd: clag_set_sys_mac_br: set bridge mac 00:00:00:00:00:00
mstpd: create_if: Add iface swp1 as port#2 to bridge bridge
mstpd: set_if_up: Port swp1 : up
mstpd: create_if: Add iface swp2 as port#1 to bridge bridge
mstpd: set_if_up: Port swp2 : up
mstpd: set_br_up: Set bridge bridge up
mstpd: MSTP_OUT_set_state: bridge:swp1:0 entering blocking state(Disabled)
mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Disabled)
mstpd: MSTP_OUT_flush_all_fids: bridge:swp1:0 Flushing forwarding database
mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database
mstpd: MSTP_OUT_set_state: bridge:swp1:0 entering learning state(Designated)
mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering learning state(Designated)
sudo: pam_unix(sudo:session): session closed for user root
mstpd: MSTP_OUT_set_state: bridge:swp1:0 entering forwarding state(Designated)
mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering forwarding state(Designated)
mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database
mstpd: MSTP_OUT_flush_all_fids: bridge:swp1:0 Flushing forwarding database

Spanning Tree Blocking

/var/log/syslog

mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Designated)
mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering learning state(Designated)
mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering forwarding state(Designated)
mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database
mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database
mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Alternate)
mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database

Layer 3 Protocols

When FRR boots up for the first time, there is a different log file for each activated daemon. If you edit the log file (for example, through vtysh or frr.conf), the integrated configuration sends all logs to the same file.

To send FRR logs to syslog, apply the configuration log syslog in vtysh.

BGP

When monitoring BGP, check if BGP peers are operational. There is not much value in alerting on the current operational state of the peer; monitoring the transition is more valuable, which you can do by monitoring syslog.

Monitoring the routing table provides trending on the size of the infrastructure. This is useful when you integrate with host-based solutions (such as Routing on the Host) when the routes track with the number of applications available.

BGP Element	Monitoring Commands	Interval Poll
BGP peer failure	`sudo vtysh -c "show ip bgp summary json"`	60 seconds
BGP route table	`sudo vtysh -c "show ip bgp json"`	600 seconds

BGP Logs	Log Location	Log Entries
BGP peer down	/var/log/syslog /var/log/frr/*.log	bgpd[3000]: %NOTIFICATION: sent to neighbor swp1 4/0 (Hold Timer Expired) 0 bytes bgpd[3000]: %ADJCHANGE: neighbor swp1 Down BGP Notification send

OSPF

When monitoring OSPF, check if OSPF peers are operational. There is not much value in alerting on the current operational state of the peer; monitoring the transition is more valuable, which you can do by monitoring syslog.

OSPF Element	Monitoring Commands	Interval Poll
OSPF protocol peer failure	`sudo vtysh -c "show ip ospf neighbor all json"` `cl-ospf summary show json`	60 seconds
OSPF link state database	`sudo vtysh - c "show ip ospf database"`	600 seconds

Route and Host Entries

Route Element	Monitoring Commands	Interval Poll
Host Entries	`cl-resource-query` `cl-resource-query -k`	600 seconds
Route Entries	`cl-resource-query` `cl-resource-query -k`	600 seconds

Routing Logs

Layer 3 Logs	Log Location	Log Entries
Routing protocol process crash	/var/log/syslog	frrouting[1824]: Starting FRRouting daemons (prio:10):. zebra. bgpd. bgpd[1847]: BGPd 1.0.0+cl3u7 starting: vty@2605, bgp@:179 zebra[1840]: client 12 says hello and bids fair to announce only bgp routes watchfrr[1853]: watchfrr 1.0.0+cl3u7 watching [zebra bgpd], mode [phased zebra restart] watchfrr[1853]: bgpd state -> up : connect succeeded watchfrr[1853]: bgpd state -> down : read returned EOF cumulus-core: Running cl-support for core files bgpd.3030.1470341944.core.core_helper core_check.sh[4992]: Please send /var/support/cl_support__spine01_20160804_201905.tar.xz to Cumulus support watchfrr[1853]: Forked background command [pid 6665]: /usr/sbin/service frr restart bgpd watchfrr[1853]: watchfrr 0.99.24+cl3u2 watching [zebra bgpd ospfd], mode [phased zebra restart] watchfrr[1853]: zebra state -> up : connect succeeded watchfrr[1853]: bgpd state -> up : connect succeeded watchfrr[1853]: watchfrr: Notifying Systemd we are up and running

Layer 3 Logs

Log Location

Log Entries

Routing protocol process crash

/var/log/syslog

frrouting[1824]: Starting FRRouting daemons (prio:10):. zebra. bgpd.
bgpd[1847]: BGPd 1.0.0+cl3u7 starting: vty@2605, bgp@:179
zebra[1840]: client 12 says hello and bids fair to announce only bgp routes
watchfrr[1853]: watchfrr 1.0.0+cl3u7 watching [zebra bgpd], mode [phased zebra restart]
watchfrr[1853]: bgpd state -> up : connect succeeded
watchfrr[1853]: bgpd state -> down : read returned EOF
cumulus-core: Running cl-support for core files bgpd.3030.1470341944.core.core_helper
core_check.sh[4992]: Please send /var/support/cl_support__spine01_20160804_201905.tar.xz to Cumulus support
watchfrr[1853]: Forked background command [pid 6665]: /usr/sbin/service frr restart bgpd
watchfrr[1853]: watchfrr 0.99.24+cl3u2 watching [zebra bgpd ospfd], mode [phased zebra restart]
watchfrr[1853]: zebra state -> up : connect succeeded
watchfrr[1853]: bgpd state -> up : connect succeeded
watchfrr[1853]: watchfrr: Notifying Systemd we are up and running

Logging

The table below describes the various log files.

Logging Element	Monitoring Commands	Log Location
syslog	Catch all log file. Identifies memory leaks and CPU spikes.	/var/log/syslog
switchd functionality	Hardware Abstraction Layer (HAL).	/var/log/switchd.log
Routing daemons	FRR zebra daemon details.	/var/log/daemon.log
Routing protocol	The log file is configurable in FRR. When FRR first boots, it uses the non-integrated configuration so each routing protocol has its own log file. After booting up, FRR switches over to using the integrated configuration, so that all logs go to a single place. To edit the location of the log files, use the log file command. By default, Cumulus Linux does not send FRR logs to syslog. Use the log syslog command to send logs through `rsyslog` and into `/var/log/syslog`. Note: To write syslog debug messages to the log file, you must run the log syslog debug command to configure FRR with syslog severity 7 (debug); otherwise, when you issue a debug command such as `debug bgp neighbor-events`, no output logs to `/var/log/frr/frr.log`. However, when you manually define a log target with the log file `/var/log/frr/debug.log` command, FRR automatically defaults to severity 7 (debug) logging and the output logs to `/var/log/frr/frr.log`.	/var/log/frr/zebra.log /var/log/frr/.log /var/log/frr/frr.log

Device Management

Device Access Logs

Access Logs	Log Location	Log Entries
User Authentication and Remote Login	/var/log/syslog	sshd[31830]: Accepted publickey for cumulus from 192.168.0.254 port 45582 ssh2: RSA 38:e6:3b:cc:04:ac:41:5e:c9:e3:93:9d:cc:9e:48:25 sshd[31830]: pam_unix(sshd:session): session opened for user cumulus by (uid=0)

Access Logs

Log Location

Log Entries

User Authentication and Remote Login

/var/log/syslog

sshd[31830]: Accepted publickey for cumulus from 192.168.0.254 port 45582 ssh2: RSA 38:e6:3b:cc:04:ac:41:5e:c9:e3:93:9d:cc:9e:48:25
sshd[31830]: pam_unix(sshd:session): session opened for user cumulus by (uid=0)

Device Super User Command Logs

Super User Command Logs	Log Location	Log Entries
Executing commands using sudo	/var/log/syslog	sudo: cumulus: TTY=unknown ; PWD=/home/cumulus ; USER=root ; COMMAND=/tmp/script_9938.sh -v sudo: pam_unix(sudo:session): session opened for user root by (uid=0) sudo: pam_unix(sudo:session): session closed for user root

Super User Command Logs

Log Location

Log Entries

Executing commands using sudo

/var/log/syslog

sudo: cumulus: TTY=unknown ; PWD=/home/cumulus ; USER=root ; COMMAND=/tmp/script_9938.sh -v
sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
sudo: pam_unix(sudo:session): session closed for user root

Monitoring Best Practices

Trend Analysis Using Metrics

Generate Alerts with Triggered Logging

Log Formatting

Hardware

System Data

CPU Idle Time

Disk Usage

Process Restart

Layer 1 Protocols and Interfaces

Layer 2 Protocols

Layer 3 Protocols

BGP

OSPF

Route and Host Entries

Routing Logs

Logging

Device Management

Device Access Logs

Device Super User Command Logs

Products

Solutions

Learn