Anycast Design Guide
Routing on the Host enables you to run OSPF or BGP directly on server hosts. This can enable a network architecture known as anycast, where servers provide the same service without needing layer 2 extensions or load balancer appliances.
Anycast is not a new protocol or protocol implementation and does not require any additional network configuration. Anycast leverages the equal cost multipath (ECMP) capabilities inherent in layer 3 networks to provide stateless load sharing services.
The following image depicts an example anycast network. Each server is advertising the 172.16.255.66/32 anycast IP address.
Anycast relies on layer 3 equal cost multipath functionality to provide load sharing throughout the network. Each server announces a route for a service. As the route propagates through the network, each network device sees the route as originating from multiple places. As an end user connects to the anycast IP, each network device performs a hardware hash of the layer 3 and layer 4 headers to determine which path to use.
Every packet in a flow from an end user has the same source and destination IP address as well as source and destination port numbers. The hash performed by the network devices results in the same answer for every packet, ensuring all packets in a flow go to the same destination.
In the following image, the client initiates two flows: the blue, dotted flow and the red dashed flow. Each flow has the same source IP address (the client’s IP address), destination IP address (172.16.255.66) and same destination port (depending on the service; for example, DNS is port 53). Each flow has a unique source port generated by the client.
In this example, each flow hashes to different servers based on this source port, which you can see when you run
ip route show to the destination IP address:
cumulus@spine02$ ip route show 172.16.255.66 172.16.255.66 proto zebra metric 20 nexthop via 169.254.64.0 dev swp1 weight 1 nexthop via 169.254.64.2 dev swp2 weight 1 nexthop via 169.254.64.2 dev swp3 weight 1 nexthop via 169.254.64.0 dev swp4 weight 1
On a Cumulus Linux switch, you can see the hardware hash with the
cl-ecmpcalc command. In the illustration above, two flows originate from a remote user destined to the anycast IP address. Each session has a different source port. Use the
cl-ecmpcalc command to see that the sessions hash to different egress ports.
cumulus@spine02$ sudo cl-ecmpcalc -p udp -s 10.2.0.100 --sport 32700 -d 172.31.255.66 --dport 53 -i swp51 ecmpcalc: will query hardware swp2 cumulus@spine02$ sudo cl-ecmpcalc -p udp -s 10.2.0.100 --sport 31884 -d 172.31.255.66 --dport 53 -i swp51 ecmpcalc: will query hardware swp3
Anycast with TCP and UDP
A key component to the functionality and cost effective nature of anycast is that the network does not maintain state for flows. Cumulus Linux handles every packet individually through the routing table, saving memory and resources required to track individual flows, similar to the functionality of a load balancing appliance.
Every packet in a flow hashes to the same next hop. However, if that next hop is no longer valid, the traffic flows to another anycast next hop instead. For example, in the image below, if leaf03 fails, traffic flows to a different anycast address; in this case, server04:
For stateless applications that rely on UDP, like DNS, this does not present a problem. However, for stateful applications that rely on TCP, like HTTP, this breaks any existing traffic flows, such as a file download. If the TCP three-way handshake occurs on server03, after the failure, server04 has no connection built and sends a TCP reset message back to the client, restarting the session.
TCP applications in an anycast environment have short-lived flows (measured in seconds or less) to reduce the impact of network changes or failures.
Resilient hashing provides a method to prevent failures from impacting the hash result of unrelated flows. However, resilient hashing does not prevent rehashing when you add new next hops.
The hardware hashing function determines which path gets used for a given flow. The simplified version of that hash is the combination of protocol, source IP address, destination IP address, source layer 4 port and destination layer 4 port. The full hashing function includes these fields and, also, the list of possible layer 3 next hop addresses. The hash result passes through a modulo of the number of next hop addresses. If the number of next hop addresses changes, either through addition or subtraction of the next hops, this changes the hash result for all traffic, including already established flows.
In the example above, leaf03 is in a failed state; traffic hashes to server04. This is a result of the hash considering three possible next hop IPs (leaf01, leaf02, leaf04). When leaf03 comes back online, the number of possible next hop IPs grows to four. This changes the modulo value that is part of the hashing function, which results in sending traffic to a different server, even if unaffected by the change.
As you can see below, leaf03 is in a failed state. The blue dotted flow uses leaf02 to reach server02.
As leaf03 comes back into service, the hashing function on spine02 changes, impacting the blue dotted flow:
Just as the addition of a device can impact unrelated traffic, the removal of a device can also impact unrelated traffic because the modulo of the hash function changes. You can see this below, where the blue dotted flow goes through leaf01 and the red dashed line goes through leaf04.
Now, leaf02 fails. As a result, the modulo on spine02 changes from four possible next hops to three next hops. In this example, the red dashed line rehashes to leaf03:
To solve this issue, resilient hashing can prevent traffic flows from shifting on unrelated failure scenarios. With resilient hashing enabled, the failure of leaf02 does not impact both existing flows because they do not flow through leaf02:
Although resilient hashing can prevent rehashing on next hop failures, it cannot prevent rehashing on next hop addition.
You can read more information on resilient hashing in the ECMP chapter.
Applications for Anycast
DP-based applications are great candidates for anycast architectures, such as NTP or DNS.
When considering applications to deploy in an anycast scenario, the first two questions to answer are:
- Does the application rely on TCP for proper sequencing of data.
- Does the application rely on more than one session as part of the application.
Applications with Multiple Connections
The network has no knowledge of any sessions or relationships between different sessions for the same application. This affects protocols that rely on more than one TCP or UDP connection to function properly - one example being FTP.
FTP data transfers require two connections: one for control and one for the file transfer. These two connections are independent, with their own TCP ports. If you deploy an FTP server in an anycast architecture, when the secondary data connection initiates, the destination of the traffic is initially the same FTP server IP address. However, the network hashes this traffic as a new, unique flow because the ports are different. This results in the new session on a new server. The new server accepts that data connection if the FTP server application is capable of robust information sharing, as it has no history of the original request in the control session.
Initiating Traffic and Receiving Traffic
Do not start an outbound TCP session over an anycast IP address; it is possible that traffic originating from an anycast IP address does not return to the same anycast server after the network hash. With inbound sessions, the network hash is the same for all packets in a flow and the inbound traffic hashes to the same anycast server.
TCP and Anycast
You can use TCP-based applications with anycast, with the following recommendations:
- TCP sessions are short lived.
- The impact of a failed session or TCP reset does not impact the application. For example, a web page refresh is acceptable.
- Application-level session management exists, which is independent of the TCP session.
- A redirection layer handles incorrectly hashed flows.
Do not use TCP applications that have longer-lived flows as anycast services. For example:
- FTP or other large file transfers.
- Transactions that you must complete and log. For example, financial transactions.
- Streaming media without application-level automated recovery.
Anycast can provide a low cost, highly scalable implementation for services. However, the limitations inherent in network-based ECMP makes anycast challenging to integrate with some applications. An anycast architecture is best suited for stateless applications or applications that are able to share session state at the application layer.