Yeah right :-) Maybe after we sort out problems in our own back yard...
Our web server crashed again the other day (it last happened about 2 years ago). I was on holiday at the time and got an automated message saying "www.loadbalancer.org is toast!". I thought OK thats annoying but not the end of the world, but it was a Sunday afternoon and about an hour later I got a message from one of our support guys saying that they could not get through to the 24*7 support engineers to look into the server failure. Thats when I remembered that last time this happened I thought about setting up a mirror dedicated server to save downtime in the event of a re-build being required...oops didn't do that did I?
Anyway we didn't get the web site back up until 11am on Monday morning (how sloppy is that for a load balancer vendor?). While the site was down one of the support engineers ordered a new dedicated server from another hosting provider and almost had the new one ready by the time the original was back up.
So to cut a long story short we now have two dedicated servers and the data on the master is replicated to the slave with rsync, we toyed with the idea of having the servers in a DNS round robin configuration (i.e. load balanced) but then we thought why not just replicate once a day and if we have a hardware failure then manually change the DNS... Why not full DNS round robin? and for that matter why not a full cluster with some Loadbalancer.org appliances in front?... um, how about:
- Increased downtime
Hey, hang on "Increased downtime"? surely a high availability load balanced system would increase my availability and not decrease it?
Well see points 3 & 4, the complexity of maintaining the cluster can easily make your actual availability less than that of a much cheaper single server. Remember that our server hasn't been down in 2 years, technically that's already better than 99.999% availability (by luck I know).
Andrew Hileshas a much better description of all this here: Five nines: chasing the dream?
So am I saying Loadbalancer.org appliances don't provide 99.999% availability? No, I'm saying that they probably won't, but definitely can :-).
Some vendors make a fair bit of hype about their products enabling 99.999% availability out of the box (Kemp for example):
"Say a hosting provider advertises 99.999 percent network availability. Good, the customer needs that. However, network is half the requirement of the customer. The customer needs 99.999 availability of the application that generates his revenue stream. To get the application's availability the hosting provider's customer must purchase his own load balancer or purchase the high availability service - through to the server and application - from the hosting provider. KEMP's load balancers are priced so that the hosting provider can pay less for the value add tool." - Kevin Mahon, Kemp Technologies.
OK Kevin, your load balancers are cheap but you can't just buy 99.999% availability off the shelf, you need to work at it, document it, build it, maintain it, test it, test it again...you get what I mean.
99.999% costs money, lots of money.. and does the customer really need it?
But when it comes to marketing load balancing hardware "99.999% uptime guaranteed" does sound better than my alternative: "Loadbalancer.org appliances probably make 99.9% uptime pretty easy and 99.999% uptime theoretically possible?".
OK I won't give up the day job.
Update.. I've decided to extend this blog entry with links to various down time stories to serve as reminders of how not to do it:
I was browsing Spice Works the other day and came across an interesting post about a CISCO CSS induced Spiceworks Outage. This is a classic example of 'health check hell', if your health check is too fast/too strict etc. Then the likely result is that your whole cluster will go down...
Obviously you need to make sure that your health check strikes a balance between time to recognise a failure and false positives i.e. server was just a bit slow...
However one of our recommended options is to either use a single fallback server with no health check i.e. its always up!
Or even better use a pool of fallback servers with far less strict health checking (Just in case).
As always you need to make your own decision on this kind of thing and think about what will happen in each failure case (in advance!)
I came across an interesting post on RightScale about the Amazon ELB/Netflix outage, interestingly it recommends a loose architecture with something like HAProxy (or Loadbalancer.org obviously!) as the solution. However I'm not sure that I agree, on one level I'm inclined to trust large providers like Amazon to get it right 99.9999% of the time...
But yet again you need to make your own decisions in advance about your response to disaster. Start at the DNS level and work your way down to the server level. One of the nice things about AWS is that it kind of forces you to make some high availability decisions right at the start of the process. We have a lot of customers starting to use our AWS Load Balancer in multiple regions with ELB/Route 53 in front and the Loadbalancer.org appliance handling failover between availability zones (within the regions). Again these customers need to go to a lot of effort to ensure all this complexity doesn't come back to bite them in the arse.