Confused_pxbAs the title suggests we are going to discuss the options available to us when hosting applications across multiple sites or even data centres.

This question is one often raised and can come in many forms, such as:

Can you support failover between data centres?

Sure, but how are they connected and what method shall we use?

Can I have the master in one site and the slave in another?

Sure, but you'll need a common IP space between sites so will need to look into stretching your network...

Do you support GSLB?

Yes, we can work with 3rd party GSLB solutions and also can help you provision solutions such as Route53. But we don't offer a GSLB installed on the appliance itself.

Those are the top three, there are certainly others but the above are the most common. So, what do we need to do? Well, there are several options but no magic bullet one size fits all answer. So let us first look at the options and technologies at our disposal:

CDN

If you are running a public facing corporate website this is one of your best options! They offer a wealth of features such as DDoS protection, caching and geo redundancy so will not only help to provide high availability but also protect your web sites. However, it's pretty useless for any internal applications...

Mostly, these guys are using Anycast and DNS (very advanced GSLB solutions) on your behalf as part of their products so you'll often be getting the benefits I mention below for these technologies. However, I always suggest that you talk to the vendor about what you're actually going to get. For example, I don't think Akamai offer an Anycast solution directly to customers, but obviously use it as part of their services such as their DNS which will almost certainly be backed by an Anycast solution.

Check out the sites below for more information on Cloudflare CDN and Anycast:

DNS

We actually have a few of options here, we could simply switch the DNS by hand and use a short TTL, but that sucks as it requires manual intervention. So then we automate it in some way making a script to monitor our applicaton, maybe utilising TSIG to update the entries much like those free dynamic DNS services (DynDNS for example). This sort of thing could be done from a real server or even the load balancer as part of a really cool health check! But this sucks too, just a lot less and now it's a homebrew solution with all the issues that may come from that... not forgetting that you are in both scenarios relying on the client to perform another lookup which it may not do straight away or even at all! Caching the result it initally recieved...

So, we could adopt an approach of returning several A records which in theory should allow the client to cache all answers and try them in turn until one works. However, this usually leads to longer connections than usual as clients try the bad answer(s), certainly it works for HTTP(S) traffic as a web browser shouldn't throw an error until it's attempted all the answers it received. At worst it might require a page refresh but as the bad address could still be tried mileage may vary especially with non web traffic.

Don't get me wrong there are actually some pretty cool things you can do with your average DNS server, here are some cool websites:

GSLB, more advanced DNS

Well, this is one option that we've blogged about before, and it's not perfect! The key differences vs other DNS based solutions is proper health checking and that it will usually offer additional benefits/gimmicks such as routing traffic preferentially to local servers, geo load balancing to the closest geo location or shortest round trip.

Like any DNS based solution, it's limited by either requiring the app to perform another DNS lookup or if you decide to serve multiple answers these could be cached by the application before they subsequently fail a health check.

However, it is automated and the best DNS based solution you can get so if nothing else should reduce potential support calls over other DNS based solutions especially if you also return multiple answers.

You can't get much better than the offerings supplied by Dyn and AWS (Route 53) if providing a public internet facing application, however, for inernal multi-site solutions, you may need to consider a simple GSLB appliance or even a homebrew solution.

Simple homebrew example:
https://www.loadbalancer.org/blog/gslb-why-global-server-load-balancers-dont-always-suck-polaris-gslb/

Route53 and DynDNS:
http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover.html
https://dyn.com/blog/what-is-dns-load-balancing-why-is-it-important/

LAN Extension

Some customers may choose to stretch a subnet/VLAN over multiple locations as I mentioned above, you'll need to have a VERY good link between sites if you intend to run a pair of load balancers in HA across it. You'll also need some high end switch or router equipment like what Cisco and other top end providers offer unless you have access to dark fibre or some kind of L2 WAN for a physical link. This option allows you to have a master in one location with a slave in another failing over the VIP address between sites using our built in HA assuming link speed and reliability allows. Once you have the same subnet available on both sides it's simple to setup as you just pair the appliances as normal but with one at each location.

To achieve this, you can use various solutions as mentioned from a long cable between two buildings to virtual network overlay solutions such as Cisco's OTV technology which routes L2 networks over L3 connectivity, open source solutions exist as well such as Open vSwitch.

Cisco OTV on the Nexus 7000 Series
https://www.cisco.com/c/en/us/products/collateral/switches/nexus-7000-series-switches/solution_overview_c22-574939.html

Connection sites L2 with Open vSwitch
https://blog.remibergsma.com/2015/03/26/connecting-two-open-vswitches-to-create-a-l2-connection/

Anycast / BGP

Anycast is basically having the same IP address reachable on multiple nodes, different routers will advertise different paths that terminate in different locations, you can also perform health checks as with GSLB but it is a far more complex solution to manage yourself requiring an understanding of BGP to achieve.

It's potentially the ultimate solution, certainly better than any GSLB as it offers a much faster switchover on failure hopefully invisible to users if the application isn't stateful, many CDN's are already using this technology to help provide their services as I mentioned above which is one of the main reasons they're actually so good. It was originally designed to make DNS servers highly available but people are starting to employ it more and more for websites and other TCP traffic.

The best explanation I could find online:
http://www.slashroot.in/what-anycast-and-how-it-works

Nice implementation example:
https://vincent.bernat.im/en/blog/2013-exabgp-highavailability

Cloudflare's take:
https://blog.cloudflare.com/cloudflares-architecture-eliminating-single-p/

Great aggressive slideshow on Anycast and TCP:
https://www.nanog.org/meetings/nanog37/presentations/matt.levine.pdf

Additional load balancer in the cloud

One pretty cool solution for public facing applications is using a load balancer hosted in a big resilient cloud like AWS, there are costs involved in terms of hosting and data usage but it offers a near perfect solution otherwise for most scenarios as you can generally rely on Amazon's resilience and/or use two load balancers in seperate availability zones to maximise HA. Or go nuts! Use two seperate load balancers in two seperate regions (or even cloud providers!) and combine GSLB in the form of Route53 in front...

Great! But which option is really the best? That depends - so let's focus on the following three scenarios:

Example 1 - Layer 2 linked offices with clients in both locations and HA accross sites.
Alt
This example can be achieved by providing a Layer 2 link so extending the LAN accross sites. This is the only option if you want to establish an active/passive pair between sites so it doesn't matter if it's sites linked with cables or a more complex solution. You'll need to make sure that you have the same subnets at Layer 2 available on both sides, failover is performed by a GARP (Gratuitous ARP) so MAC address resolution needs to work accross sites.

Example 2 - Multiple sites, Layer 3 linked offices with fallback to other site.
Alt
Here we can use multiple pairs for each site, we then utilise the fallback server to point each site to the other so that if all real servers fail health checks on one side the load balancer will use the other sides VIP as a real server until one of the local servers return.

But, this has the drawback of needing one of the pair of load balancers also still available during such an outage to serve the fallback server... so it doesn't necessarily work during a complete server room failure unless you also host the servers in multiple rooms or at the very least I'd suggest seperate racks. However, most commonly a total server room failure means the site has also lost connectivity so this is usually enough for most. If, however, you do have the requirement to survive a local server room failure entirely you could also use a DNS solution in tandem so clients could then still lookup DNS on the other side and get the remote answer.

Example 3 - Multiple data centres with external clients.
Alt
This would benefit from a cloud based load balancer or an Anycast solution if you're really looking for maximum availabilty and failover times. The cloud based load balancer offers advanced persistence types and potentialy much better health checks.

If considering Anycast then the easy option would be via a managed provider like Cloudflare, although for the more brave amongst us who can tackle their own BGP routing you might consider managing it yourself. For those with far simpler availability and failover requirements a rebust externally hosted GSLB solution like Route 53 may suffice.

As I said at the begining there is no magic bullet... You'll need to choose which of these options will work best for you and which compromises you are willing to make.

I mean you could decide that a manual DNS change is acceptable for your less critical applications, many people do! Or do you already have the equipment and expertise internally needed to tackle say a LAN extension or BGP solution yourselves, it's a tough choice (!) so it will certaily depend on both the application as well as the organisation itself.

If making such architectural decisions I always suggest talking to the vendors involved both the load balancer provider as well as the providers of your other network infrastructure to see what options they can help you with.

Any questions though, feel free to leave me a message below!