NVIDIA MELLANOX TELEMETRY AGENT USER MANUAL V2.7.10

Controlling Telemetry Agent

The connection between the telemetry agent and the controller is performed over TCP socket in JRPC protocol.

Once connectivity is established, it will be restored even if the controller is restarted or if network disconnections occur.

The telemetry agent currently supports two types of requests:

  • Configuration – configures the agent to start, stop restart the streaming session.

  • Keepalive – returns telemetry agent capabilities to the controller

The configuration request supports 4 message types: “query”, "replace", "append", "remove", and “remove-all”.

  • To initiate a query request, send “query”

  • To create a new telemetry session or to restart a session with new parameters, use "replace"

  • To stop the currently running telemetry session, send "remove"

In order to easily control the agent, you can use a controller script located inside the docker.

The script is located under /opt/telemetry/session_controller.

Users may use the controller script from the switch CLI.

In order to do so:

  1. Deploy the agent and create trust with the switch as described in "Deploying Docker Image on Mellanox Onyx-Based Systems".

  2. Now you may run controller script from switch. Run:

    Copy
    Copied!
                

    docker exec neo-agent session-controller

Usage:

Copy
Copied!
            

/opt/telemetry/session_controller [-h] [-destination_ip <destination_ip>] [-destination_port <destination_port>] [-interval <interval>] [-protocol {TCP/UDP}] [-format {JSON/Influx DB Line Protocol/Protocol Buffers/gRPC}] [-controller_ip <controller_ip>]

Optional Arguments

Description

--help, -h

Show help message and exit

-destination_ip <destination_ip>

The default destination IP

-destination_port <destination_port>

The default destination port (default is 5123)

-interval <interval>

The default collection interval (msec) (default is 1000)

-protocol {TCP/UDP}

The default streaming protocol (default is "TCP")

-format {JSON/Influx DB Line Protocol/Protocol Buffers/gRPC

The default streaming protocol (default is "Influx DB Line Protocol")

-controller_ip <controller_ip>

The IP of the Telemetry controller (default is "0.0.0.0")

When running the script several options available to perform: Create session, delete session, delete all sessions, status.

And at any point while running the script you may type "return" to go back to the main menu or type "quit" to exit.

The system has default values as shown in the image above, but those values can be changed in the first run of the script or at any time when opening a session.

Script parameters:

  • Create session – starts a session on the agent, with default parameters or parameters provided on the run

  • Delete session – deletes a session selected if it is running

  • Delete all sessions – deletes all running sessions on the agent

  • Session – provides status of the agent and running sessions, or if a specific session is selected it will provide status on it. Status includes global system errors and running session, per session.

With the script the user can also open a certain session for several collectors. This is done by typing ‘yes’ to ‘add new collector’ query.

Copy
Copied!
            

Reading collector params... Please enter destination ip: 10.213.91.145 Please enter destination port (default 5123):   Formats 1. JSON 2. Influx DB Line Protocol 3. Protocol Buffers 4. gRPC Please select format or enter to continue (default Influx DB Line Protocol): 1 Please enter protocol (TCP/UDP) (default TCP): Enter "yes" to add new collector or press enter to continue: yes Reading collector params... Please enter destination ip: 10.213.91.146 Please enter destination port (default 5123):   Formats 1. JSON 2. Influx DB Line Protocol 3. Protocol Buffers 4. gRPC Please select format or enter to continue (default Influx DB Line Protocol): 2 Please enter protocol (TCP/UDP) (default TCP): Enter "yes" to add new collector or press enter to continue:

Interface counter session allows the the user choose a dynamic counter list for data sampling.

Copy
Copied!
            

Telemetry Sessions 1. WJH - Samples the dropped packets buffer 2. Interface counters - Samples interface counters 3. Threshold events - Events generated every time a defined threshold is crossed 4. Histograms - Samples the buffer histograms Please select session type: 2 Subscribing session Interface counters...   Filter Settings   Select port counters to be streamed 1. [X] ECN Packets 2. [X] In Broadcast Packets 3. [X] In Discards 4. [X] In Errors 5. [X] In FCS Errors 6. [X] In Multicast Packets 7. [X] In Octets 8. [X] In Oversize Packets 9. [X] In Packets 10. [X] In Packets Jumbo 11. [X] In Packets Of 1024-1518 Bytes 12. [X] In Packets Of 128-255 Bytes 13. [X] In Packets Of 256-511 Bytes 14. [X] In Packets Of 512-1023 Bytes 15. [X] In Packets Of 64 Bytes 16. [X] In Packets Of 65-127 Bytes 17. [X] In Pause Packets 18. [X] In Undersize Packets 19. [X] In Unicast Packets 20. [X] Out Broadcast Packets 21. [X] Out Discards 22. [X] Out Errors 23. [X] Out Multicast Packets 24. [X] Out Octets 25. [X] Out Packets 26. [X] Out Pause Packets 27. [X] Out Unicast Packets 28. [X] Symbol Error 29. [X] Unknown Control Opcode ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select priority counters to be streamed 1. [X] Bytes 2. [X] No Buffer Discard 3. [X] Packets 4. [X] RX Pause Duration 5. [X] RX Pause Packets 6. [X] Shared Buffer Discard 7. [X] TX Bytes 8. [X] TX No Buffer Discard 9. [X] TX Packets 10. [X] TX Pause Duration 11. [X] TX Pause Packets 12. [X] TX Wred Discard ------------------------------------   Enter number to select/unselect or press enter to continue:

For WJH sessions, there are two options for filtering:

  • Default – all events except Forwarding with Notice severity

  • Custom – custom event filtering for each WJH category

Copy
Copied!
            

Telemetry Sessions 1. WJH - Samples the dropped packets buffer 2. Interface counters - Samples interface counters 3. Threshold events - Events generated every time a defined threshold is crossed 4. Histograms - Samples the buffer histograms Please select session type: 1 Subscribing session WJH...   Filter Settings 1. Default - All events except Forwarding with Notice severity 2. Custom - Custom events filtering for each WJH category Please select option: 2   Select WJH Categories 1. [X] ACL 2. [X] L1 3. [X] Forwarding 4. [X] Buffer ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select ACL Notice severity events to be streamed 1. [X] Ingress port ACL 2. [X] Ingress router ACL ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select L1 aggregation Error severity events to be streamed [X] Symbol error [X] CRC error ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select L1 aggregation Notice severity events to be streamed [X] Port state change ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select L2 Error severity events to be streamed 1. [X] Destination MAC is reserved (DMAC=01-80-C2-00-00-0x) 2. [X] VLAN tagging mismatch 3. [X] Ingress VLAN filtering 4. [X] Unicast MAC table action discard 5. [X] Port loopback filter 6. [X] Source MAC is multicast 7. [X] Source MAC equals destination MAC ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select L2 Warning severity events to be streamed 1. [X] Multicast egress port list is empty ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select L2 Notice severity events to be streamed 1. [ ] MLAG port isolation 2. [ ] Ingress spanning tree filter ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select L3 Error severity events to be streamed 1. [X] Unicast destination IP but multicast destination MAC 2. [X] Destination IP is loopback address 3. [X] Source IP is multicast 4. [X] Source IP is in class E 5. [X] Source IP is loopback address 6. [X] Source IP is unspecified 7. [X] Checksum or IPver or IPv4 IHL too short 8. [X] Multicast MAC mismatch 9. [X] Source IP equals destination IP 10. [X] IPv4 source IP is limited broadcast 11. [X] IPv4 destination IP is local network (destination=0.0.0.0/8) 12. [X] IPv4 destination IP is link local ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select L3 Warning severity events to be streamed 1. [X] Blackhole route 2. [X] Unresolved neighbor/next-hop 3. [X] Blackhole ARP/neighbor 4. [X] Ingress router interface is disabled 5. [X] Egress router interface is disabled 6. [X] IPv4 routing table (LPM) unicast miss 7. [X] IPv6 routing table (LPM) unicast miss 8. [X] Router interface loopback 9. [X] Packet size is larger than router interface MTU 10. [X] TTL value is too small ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select L3 Notice severity events to be streamed 1. [ ] Non-routable packet 2. [ ] IPv6 destination in multicast scope FFx0:/16 3. [ ] IPv6 destination in multicast scope FFx1:/16 4. [ ] Non IP packet ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select Tunnel Error severity events to be streamed 1. [X] Overlay switch - Source MAC is multicast 2. [X] Overlay switch - Source MAC equals destination MAC 3. [X] Decapsulation error ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue:   Select Buffer aggregation Warning severity events to be streamed [X] Tail drop [X] WRED ------------------------------------ 0. Select/Unselect all   Enter number to select/unselect or press enter to continue: Reading collector params... Please enter destination ip: 10.213.91.100 Please enter destination port (default 5123): 5003   Formats 1. JSON 2. Influx DB Line Protocol 3. Protocol Buffers 4. gRPC Please select format or enter to continue (default Influx DB Line Protocol): Please enter protocol (TCP/UDP) (default TCP): Enter "yes" to add new collector or press enter to continue:

Warning

It is not allowed to unselect all categories.

The data interchange between the controller and the telemetry agent takes place over JRPC. JRPC or JSON-RPC is a remote procedure call protocol encoded in JSON. JRPC protocol is used for passing the OpenConfig telemetry data in order to configure the telemetry agent session.

Telemetry Data Example

  • Interface counters data example:

    Copy
    Copied!
                

    ...interfacecounterforoneport{ "cli_counter": { "in_broadcast_pkts": 0, "in_fcs_errors": 0, "in_multicast_pkts": 260, "in_octets": 45024, "in_packets": 260, "in_packets_jumbo": 0, "in_ucast_pkts": 0, "out_broadcast_pkts": 6, "out_multicast_pkts": 4092, "out_octets": 293576, "out_packets": 4098, "out_ucast_pkts": 0 }, "port": "Eth1/17", "rfc_2819_counter": { "in_oversize_packets": 0, "in_packets_of1024to1518_bytes": 0, "in_packets_of128to255_bytes": 529, "in_packets_of256to511_bytes": 0, "in_packets_of512to1023_bytes": 0, "in_packets_of64_bytes": 0, "in_packets_of65to127_bytes": 0, "in_undersize_packets": 0 }, "rfc_2863_counter": { "in_discards": 0, "in_errors": 0, "out_discards": 0, "out_errors": 0 }, "rfc_3635_counter": { "in_pause_packets": 0, "out_pause_packets": 0, "symbol_error": 0, "unknown_control_opcode": 0 }, "speed": 12500000000, "pri_counters": [ { "priority": "0", "rx_pause_duration": 0, "rx_pause_pkts": 0, "tx_pause_duration": 0, "tx_pause_pkts": 0 }, … ], "buffer_counters": [ { "buffer_id": "0", "bytes": 45024, "no_buffer_discard": 0, "pkts": 260, "shared_buffer_discard": 0 }, … ], "tc_counters": [ { "traffic_class": "0", "tx_bytes": 489216, "tx_no_buffer_discard": 0, "tx_pkts": 7644, "tx_wred_discard": 0 }, … ], "extended_counter": { "ecn_packets": 0 }, ...

  • Histogram data example:

    Copy
    Copied!
                

    { "device_ip": "10.209.36.26", "hist_map": { "Eth1/31.0.0": 692024, "Eth1/31.0.1": 0, "Eth1/31.0.2": 0, "Eth1/31.0.3": 0, "Eth1/31.0.4": 0, "Eth1/31.0.5": 0, "Eth1/31.0.6": 0, "Eth1/31.0.7": 0, "Eth1/31.0.8": 0, "Eth1/31.0.9": 0 }, "ts_seconds": 1595160255, "ts_useconds": 106886 }

  • Threshold events data example:

    Copy
    Copied!
                

    { "deviceIp": "10.209.37.249", "tsUseconds": 989349, "highestOccupiedBin": "0""thresholdCrossing": "Falling", "interface": "Eth1/8", "histogram": { "1": "0", "0": "399334", "3": "0", "2": "0", "5": "0", "4": "0", "7": "0", "6": "0", "9": "0", "8": "0" }, "event": "BufferOccupancyOnyx", "tsSeconds": "1593683390" }

  • Sample of WJH Events in JSON format:

    • Forwarding WJH events example:

      Copy
      Copied!
                  

      { "device_ip": "10.209.37.251", "device_name": "ufm-switch18", "drop_info": [ { "category": "Forwarding", "in_port": "Eth1/29", "packet": { "ethernet": { "dst_mac": "01:80:c2:00:00:01", "ether_type": 2048, "ether_type_name": "Internet Protocol version 4 (IPv4) (0x0800)", "src_mac": "e4:1d:2d:66:d8:6a" }, "ip": { "dst_ip": "1.1.1.253", "length": 10240, "protocol": 6, "protocol_name": "TCP (0x06)", "src_ip": "1.1.1.1", "ttl": 64, "version": 4 }, "transport": { "dst_port": 4000, "dst_port_name": "4000", "src_port": 5001, "src_port_name": "5001" } }, "packet_type": "TRANSPORT", "reason": { "description": "Destination MAC is reserved (DMAC=01-80-C2-00-00-0x)", "id": 202, "recommended_action": "Bad packet was received from the peer", "severity": "Error" }, "subcategory": "L2", "timestamp": { "nano": "992616889", "seconds": "1595246720" } }, { "category": "Forwarding", "in_port": "Eth1/29", "packet": { "ethernet": { "dst_mac": "7c:fe:90:e3:d4:88", "ether_type": 2048, "ether_type_name": "Internet Protocol version 4 (IPv4) (0x0800)", "src_mac": "e4:1d:2d:66:d8:6a", "vlan_id": 300 }, "ip": { "dst_ip": "1.1.1.253", "length": 10240, "protocol": 6, "protocol_name": "TCP (0x06)", "src_ip": "1.1.1.1", "ttl": 64, "version": 4 }, "transport": { "dst_port": 4000, "dst_port_name": "4000", "src_port": 5001, "src_port_name": "5001" } }, "packet_type": "TRANSPORT", "reason": { "description": "Ingress VLAN filtering", "id": 204, "recommended_action": "Validate the VLAN membership configuration on both ends of the link", "severity": "Error" }, "subcategory": "L2", "timestamp": { "nano": "997807206", "seconds": "1595246720" } } ], "ts_seconds": "1595246721", "ts_useconds": 27236 }

    • ACL WJH events example:

      Copy
      Copied!
                  

      { "device_ip": "10.209.37.251", "device_name": "ufm-switch18", "drop_info": [ { "acl": { "acl_name": "deny_mac_list", "acl_rule": "Priority[0];KEY[DMAC: 00:00:00:00:00:00/00:00:00:00:00:00];KEY[SMAC: 00:00:00:00:00:00/00:00:00:00:00:00];ACTION[FORWARD: FORWARD_ACTION = DISCARD];" }, "category": "ACL", "in_port": "Eth1/29", "packet": { "ethernet": { "dst_mac": "7c:fe:90:f2:8c:50", "ether_type": 2048, "ether_type_name": "Internet Protocol version 4 (IPv4) (0x0800)", "src_mac": "50:6b:4b:cc:e3:e4" }, "ip": { "dst_ip": "16.0.0.1", "length": 10240, "protocol": 6, "protocol_name": "TCP (0x06)", "src_ip": "16.0.0.1", "ttl": 64, "version": 4 }, "transport": { "dst_port": 80, "dst_port_name": "http (80)", "src_port": 20, "src_port_name": "ftp-data (20)" } }, "packet_type": "TRANSPORT", "reason": { "description": "Ingress port ACL", "id": 601, "recommended_action": "Validate ACL configuration", "severity": "Notice" }, "subcategory": "ACL", "timestamp": { "nano": "638527727", "seconds": "1595247112" } } ], "ts_seconds": "1595247113", "ts_useconds": 527251 }

    • Buffer WJH events example:

      Copy
      Copied!
                  

      { "device_ip": "10.209.37.122", "device_name": "neo-switch02", "drop_info": [ { "buffer": { "end_timestamp": { "nano": "643013147", "seconds": "1595402760" }, "event_count": "154", "start_timestamp": { "nano": "176441547", "seconds": "1595402759" } }, "category": "Buffer", "in_port": "Eth1/3", "packet": { "ip": { "dst_ip": "1.1.49.21", "protocol": 17, "protocol_name": "UDP (0x11)", "src_ip": "1.1.49.11" }, "transport": { "dst_port": 20000, "dst_port_name": "20000", "src_port": 54726, "src_port_name": "54726" } }, "reason": { "description": "WRED", "id": 504, "recommended_action": "Monitor network congestion", "severity": "Warning" }, "subcategory": "Buffer", "timestamp": { "nano": "643013147", "seconds": "1595402760" } } ], "ts_seconds": "1595402761", "ts_useconds": 169816 }

The telemetry data contains data for all the supported telemetry counters for every active switch port.

Warning

Before activating a histogram or an event threshold session, the required traffic class must be configured on the switch (via CLI).

Warning

The collector should be implemented in a way that allows to create a GRPC connection between the telemetry agent and the collector. For GRPC, PROTO3 encoding is used. The protocol buffer files needed for decoding are located inside the telemetry agent container under /opt/telemetry/proto/.


© Copyright 2023, NVIDIA. Last updated on Nov 17, 2023.