Shared Buffers

NVIDIA MLNX-GW User Manual for NVIDIA Skyway Appliance v8.2.2200

All successfully received packets by a switch are stored on internal memory from the time they are received until the time they are transmitted. The packet buffer is fully shared between all physical ports and is hence called a shared buffer. Buffer configuration is applied in order to provide lossless services and to ensure fairness between the ports and priorities.

The buffer mechanism allows defining reserved memory allocation and limiting the usage of memory based on incoming/outgoing ports and priority of the packet. In addition, the buffer can be divided into static pools, each for a specific set of priorities. Buffer configuration mechanism allows fair enforcement from both ingress and egress sides.

The standard configuration mode allows a simple and concise configuration manner by hiding direct buffer access from user, and collecting all the required configuration settings into “traffic pools”. Users that wish to gain full control of entire buffers set can do so by enabling advanced buffer configuration.

The set of configurations which will obtain the optimal shared buffer behavior according to user requirements can be applied by dividing priorities into “traffic pools”. A traffic pool is a logical representation of a traffic profile instance which is supposed to handle all buffer related allocation on the ingress and egress sides to allow fluent flow of the traffic.

Available traffic pool types are as follows:

  • Lossy – for standard lossy traffic. This is the default type for all traffic.

  • Lossless – for traffic which cannot suffer any loss. Using this type enables a flow control mechanism for the mapped priority as well as setting headroom and Xon/Xoff parameters for the relevant ingress PG buffer.

  • Lossy-MC – for layer 2 multicast traffic which requires special care due to stream duplication on the egress side over several ports.

There is no restriction for priority mapping to traffic pools. User can map all priorities to a single traffic pool or create a separate traffic pool for each priority. By default, all memory will be equally divided between all active traffic pools. User can set a memory percentage for a traffic pool out of the entire shared buffer. A state of over-subscription (where sum of percentage is bigger than 100%) is admissible although not advised.

A traffic pool will become functional if at least one priority is mapped to it. Each functional traffic pool will be matched by an iPool, ePool and iPort.PG buffer on each interface. For further detail see section “Advanced Buffer Configuration”.

Priority-flow-control

Enabling lossless traffic flow requires relevant switch-priority (see Packet Classification) to be mapped to a traffic pool type “Lossless”. This could be applied through one of the following methods:

  • Create a new custom lossless traffic pool, and map the switch-priority to the newly created traffic pool. In this case, PFC configuration is automatic. For example:

    Copy
    Copied!
                

    switch (config) # traffic pool my_pool type lossless switch (config) # traffic pool my_pool map switch-priority 0 

  • Enabling DCB PFC over the said switch-priority along with enabling DCB PFC globally. This will result in mapping of the priority to the lossless-default traffic pool which is reserved merely for this purpose. In addition it is required to enable DCB PFC for the relevant interfaces as well.

When setting lossless traffic configuration, it is strongly recommended to stick with one of the upper modes rather than a combination of them.

Flow Control (Global Pause)

Utilizing global pause mechanism requires “flowcontrol” to be enabled over the desired port and the port's default priority must be set to switch-priority 3 to configure lossless traffic over the port. The configuration steps are described in section “Priority-flow-control”.

To ensure all incoming packets are subjected to the global pause mechanism, the port's trust mode must be set to “port”.

Example:

Copy
Copied!
            

switch (config)# traffic pool my_pool type lossless switch (config)# traffic pool my_pool map switch-priority 3 switch (config)# interface ethernet 1/1 flowcontrol send on force switch (config)# interface ethernet 1/1 flowcontrol receive on force switch (config)# interface ethernet 1/1 qos default switch-priority 3 switch (config)# interface ethernet 1/1 qos trust port


Packet Buffering Classification

When a packet arrives to the switch it is classified according to its ingress port, egress port, and layer 2 and layer 3 header fields. The following terms are used to handle packet classification within the switch:

  • Port

    • Ingress port (iPort) – the port which the packet is received on

    • Egress port (ePort) – the port which the packet is transmitted on

  • Pool

    • Ingress pool (iPool) – the memory pool on which the packet is counted on the ingress side

    • Egress pool (ePool) – the memory pool on which the packet is counted on the egress side

  • Priority

    • Switch priority (SP) – internal identifier of the packet priority which is used as a key for several internal switch functions and decisions, including buffering. The SP of the packet is assigned according to a port’s trust level configuration and packet QoS identifiers in the header (PCP, DEI, DSCP).

    • Priority group (PG) – PG is combined of a group of SPs. It is used for grouping packets of several switch priorities into a single ingress buffer space. PG range is from 0-7, while PG 9 is reserved for control traffic.

    • Traffic class (TC) – TC is combined of a group of SPs. It is used for grouping packets of several switch priorities into a single egress queue and buffer space. TC range is from 0-15, while TC 8-15 is reserved for multicast traffic and TC 16 is reserved for control traffic.

Buffer configuration mechanism provides a way to allocate buffer space for specific traffic types by configuring buffers of the following types.

  • iPort.PG – traffic which arrives on a specific port and is mapped to a specific PG

  • iPort (iPort.pool) – traffic which arrives on a specific port and is counted on a specific iPool. This sums all iPort.PG mapped to the said iPool.

  • ePort.TC – traffic which is transmitted on a specific port and mapped to a specific TC

  • ePort (ePort.pool) – traffic which is transmitted on a specific port and counted on a specific ePool. It should sum up all ePort.TCs mapped to the said ePool.

Since multicast packets are duplicated among egress ports, to allow consistent packet counting on ingress and egress sides, the following buffers types are used:

  • MC.SP – multicast traffic which is classified per specific switch-priority. Counting occurs on egress side prior to packet duplication.

  • ePort.mc – multicast traffic which is going to be transmitted on a specific port

Buffer Allocation

For the aforementioned classification parameters, a buffering region can be allocated. The buffering region is defined as a set of one of the following: {iPort}, {iPort.pg}, {ePort}, {ePort.TC}, {MC} or {MC.SP}.

For buffer regions, reserved and shared buffering quotas are allocated based on the following configuration parameters:

  • Reserved allocation (size) – guaranteed buffering quota for the region which is not shared with other regions

  • Shared allocation (shared) – best-effort buffering quota for the region which can be shared with other regions and allocated dynamically. Region usage cannot overflow this quota. Shared allocation can be set using static or dynamic threshold.

  • Shared pool – static bound from which the shared space is dynamically allocated

The iPort.PG buffer can be configured to work in one of two modes:

  • Lossy – for lossy traffic

  • Lossless – for lossless traffic. In this mode, the user must define the flow control thresholds (Xoff, Xon). Reaching Xoff threshold in a PG buffer occupancy will generate “pause” frames to the sender. Reaching Xon threshold ceases “pause” frames transmission. The reserved allocation for this buffer should be at least the value of Xoff to allow sufficient ingress packet buffering for applying Xon/Xoff thresholds.

After initial admittance to headroom buffer—in which its egress port, TC, and ingress PG are defined—a packet is evaluated for eligibility for being stored in the buffer space until it is forwarded.

Buffer eligibility is defined based on the following conditions:

  1. If current usage is below allocation thresholds for all four shared:
    • iPort.PG && iPort && ePort.TC && ePort

  2. If there is available quota within at least one of the four reserved allocation regions:
    • For lossy traffic: iPort.PG || iPort || ePort.TC || ePort
    • For lossless traffic: ePort.TC || ePort. Ingress check is not performed since all the ingress reserved space is allocated for headroom.

If a packet is not eligible for buffering:

  • For lossy traffic: Packet is dropped

  • For lossless traffic: Packet stays in headroom on which Xon/Xoff thresholds are applied

Pools

Shared buffer space can be statically divided among multiple pools on the ingress side (iPools) and the egress side (ePools). Each buffer is a region that is mapped to a specific pool.

Each pool has the following parameters:

  • Size – the total size which is shared among the regions allocated to that pool. The pool’s size binds the amount of cumulative shared usage of the regions that are mapped to the pool. The size can be set to infinite value, in which case occupancy of this pool will not be taken into consideration upon admittance of the packet.

    Warning

    The pool size does not include the reserved sizes of regions.

  • Mode – working mode

    • Static – each region has a static maximum threshold defined in bytes. The user sets the maximum shared quota for this buffer from a specific pool by providing a percentage out of the bounded pool size. If the size is set to infinite, shared quota for mapped buffers gets set in bytes.

    • Dynamic – each region has a dynamic maximal threshold defined as alpha (α) which is the ratio between the current region usage and the pool’s free space (equal to the pool usage subtracted from pool size):

      • α accepts the following values 0, 1/128, 1/64, …1/2,1,2,…,64, infinity

      • Buffer acceptance condition is: region_usage < α*free pool space

The port region is counted against the pool to which the PG/TC region of the packet is mapped.

Usage Counting

A packet is counted once on the ingress side and on the egress side.

Direction

Traffic Type

Counting Buffers

Ingress

iPort.PG, iPort

Egress

Unicast

ePort.TC, ePort

Multicast

MC.SP, ePort.mc


Control Traffic Buffering

Control packets are buffered in dedicated pools: iPoolCtrl, ePoolCtrl. Furthermore, each port has a set of buffers which are dedicated to control:

  • iPort: iPort.iPoolCtrl

  • iPort.PG: iPort.pg9

  • ePort: ePort.ePoolCtrl

  • ePort.TC: iPort.tc16

All control buffers are mapped to control pools and are not configurable.

Default Configuration

The default, out-of-box configuration provides the following settings:

Pools:

  • iPool0, ePool0 – default pools for all data traffic. Set to dynamic mode with size of the entire shared buffer each.

  • iPoolCtrl, ePoolCtrl – dynamic pools dedicated for control with size of 256KB each

  • ePool15 – multicast pool with static mode and infinite size

Buffers:

  • All buffer configuration (apart from MC.SP) is similar for all ports

  • All switch-priorities are mapped to PG0

  • Each switch-priority is mapped to a corresponding TC buffer (i-to-i)

Buffer

Reserved

Shared

[%/α/Byte]

Pool

Comment

iPort.iPool0

10KB

alpha 8

iPool0 (fixed)

iPort.iPoolCtrl

0

alpha 8

iPoolCtrl

iPort control buffer

iPort.pg0

0 (20KB headroom)

alpha 8

iPool0

iPort.pg9

10KB

alpha 8

iPoolCtrl

iPort.pg control buffer

ePort.ePool0

10KB

alpha 8

ePool0 (fixed)

ePort.ePoolCtrl

0

alpha 8

ePoolCtrl

ePort control buffer

ePort.mc

10KB

90KB

ePool15 (fixed)

Multicast

ePort.tc0-7

1KB

alpha 8

ePool0

ePort.tc16

1KB

alpha 8

ePoolCtrl

ePort.tc control buffer

MC.SP0-7

0

alpha ¼

ePool0

Global multicast


Configuration Example

The following example exhibits how to divide the buffer among traffic priorities in advanced buffer management mode. Assuming that over an out-of-box lossy default configuration is set, the user here configures buffering for lossless traffic classified to switch-priority 1, over Ethernet interfaces 1/1 and 1/5.

The changes on the default configuration are as follows:

  • Advanced buffer management is enabled

  • Ingress:

    • iPool1 is assigned a size of 13MB

    • Switch-priority is bound to PG1 to allow separate configuration settings

    • PG1 is mapped to selected pool iPool1, classified as lossless and set sufficient headroom (reserved size) of 85KB. Xon/Xoff thresholds are set to 20KB. The shared alpha coefficient is set to 1.

    • iPort.pool1 buffer receives reserved size of 10k and shared coefficient of alpha 1

  • Egress:

    • ePool1 is assigned an infinite size according to recommended lossless traffic settings

    • TC1 (to which switch-priority is mapped by default) is mapped to the selected pool ePool1, and receives reserved size 0 and an infinite shared threshold

    • ePort.mc buffer receives reserved size 0 and an infinite shared threshold

    • ePort.pool1 buffer receives reserved size 0 and an infinite shared threshold

    • MC.SP1 buffer is mapped to egress pool ePool1, and gets reserved size 0 and an infinite shared threshold

  • Finally, priority-flow-control is enabled over switch-priority 1, and over the selected ports

Example:

Copy
Copied!
            

switch (config) # advanced buffer management force # Pool configuration switch (config) # pool iPool1 size 13680063 type dynamic switch (config) # pool ePool1 size inf type static  # Ingress buffer configuration switch (config) # interface ethernet 1/1 ingress-buffer iPort pool iPool1 reserved 10k shared alpha 1 switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg1 bind switch-priority 1 switch (config) # interface ethernet 1/1 ingress-buffer iPort.pg1 map pool iPool1 type lossless reserved 85k xoff 20k xon 20k shared alpha 1 switch (config) # interface ethernet 1/1 egress-buffer ePort pool ePool1 reserved 0 shared size inf switch (config) # interface ethernet 1/1 egress-buffer ePort.tc1 map pool ePool1 reserved 0 shared size inf switch (config) # interface ethernet 1/1 egress-buffer ePort.mc reserved 0 shared size inf # Egress buffer configuration switch (config) # interface ethernet 1/5 ingress-buffer iPort pool iPool1 reserved 10k shared alpha 1 switch (config) # interface ethernet 1/5 ingress-buffer iPort.pg1 bind switch-priority 1 switch (config) # interface ethernet 1/5 ingress-buffer iPort.pg1 map pool iPool1 type lossless reserved 85k xoff 20k xon 20k shared alpha 1 switch (config) # interface ethernet 1/5 egress-buffer ePort pool ePool1 reserved 0 shared size inf switch (config) # interface ethernet 1/5 egress-buffer ePort.tc1 map pool ePool1 reserved 0 shared size inf switch (config) # interface ethernet 1/5 egress-buffer ePort.mc reserved 0 shared size inf # MC buffer configuration switch (config) # pool ePool1 mc-buffer mc.sp1 reserved 0 shared size inf # PFC configuration switch (config) # dcb priority-flow-control enable force switch (config) # dcb priority-flow-control priority 1 enable switch (config) # interface ethernet 1/1 dcb priority-flow-control mode on switch (config) # interface ethernet 1/5 dcb priority-flow-control mode on


Exceptions to Legal Shared Buffer Configuration

The following configurations are permissible in spite of them not being logical since they are useful to the user in specific advanced situations:

  • Global scenarios:

    • Traffic pool memory oversubscription (total X%) and Traffic pools with size ‘Auto’ are not allocated.
      In this scenario, two or more traffic pools are configured so the sum of their sizes (specified in the percentage units) is more than 100%. In this case, upon high utilization, traffic “fights” for resources (free pool memory) and can be lost.

    • Switch priority X is mapped to a non-lossless traffic pool, but PFC is enabled on it, or switch priorities X-1,X are mapped to a non-lossless traffic pool, but PFC is enabled on them
      In these scenarios, switch priority X is mapped to a lossy or lossy-MC traffic pool (traffic is not important and traffic loss is allowed), but pause packet generation (PFC) also is enabled over this priority. These cases are allowed if the user expects traffic to be dropped but has enabled PFC to prevent it.

    • Switch priority X is mapped to a lossless traffic pool, but PFC is disabled on it, or Switch priorities X-1,X are mapped to a lossless traffic pool, but PFC is disabled on them
      As opposed to the previous scenarios, here the traffic pool is created as lossless, but pause packet generation is disabled. In these cases, the user expects traffic not to have drops, but it can be dropped.

  • Per interface scenarios:

    • <if-id> TC X is mapped to more than one traffic pool, or TCs X,X+1 are mapped to more than one traffic pool.
      In these scenarios, traffic class buffers share the same switch priority and are mapped to two different traffic pool. In this cases, with different traffic pool configuration, behavior of traffic is not determined.

    • <if-id> switch priority X is lossless but neither PFC nor FC is not enabled on this interface, or Switch priorities X-1,X are lossless but neither PFC nor FC is enabled on this interface.
      In these scenarios, the user has created a lossless traffic pool and expects that traffic would not be dropped, but pause packet generation (PFC and FC) is disabled on the interface. In these cases, traffic can be dropped.

    • <if-id> has FC enabled, but default priority 0 is not mapped to lossless traffic pool and FC may not be functional.
      In this scenario, global pause packet (FC) generation is enabled on the interface, but default switch priority (traffic arriving to the switch without priority tagging is assigned the default switch priority) is not in lossless traffic pool. In this case, traffic cam be dropped.

    • <if-id> has insufficient headroom allocation to fulfill configuration derived requirements (MTU, speed, cable-length).
      In this scenario, combination of MTU, speed, cable-length, and amount of lossless traffic pools consumes all free headroom memory. In this case, not all required buffers are configured correctly and traffic can be dropped.

For more information about this feature and its potential applications, please refer to the following community post:

© Copyright 2023, NVIDIA. Last updated on Nov 15, 2023.