Loadbalancer.org has always been about high-availability, that is the fundamental reason for our products existence. Performance has always been a nice side effect while maintainability of your application cluster is generally a key sub-set of the primary high-availability objective.
However it's time for a confession, the default settings for the Loadbalancer.org appliances in a cluster configuration up until v7.5 have been set by default for both ease of use and certainty of a valid configuration. The default recommendation for setting up or disaster recovery on the high-availability of the cluster (Heartbeat) has been to force a full sync and therefore inflict a small amount of down time in a maintenance window.
Whilst we've always had documentation showing how to handle cluster maintenance and configuration with zero downtime it was definitely time for a change.
So in theory this was a simple change to our default configuration, and yet as always the development team found a few thorny little issues that needed resolving. Previously the default heartbeat configuration used autofailback=on, this was handy in that you always new the master node would be active if it was healthy. However when it comes to a disaster recovery scenario or cluster change it becomes problematic as any failback to the master during the process causes downtime. On the positive side the old method does have the advantage that it quickly shows you if you have a heartbeat configuration problem i.e. the failback to master doesn't work.
So we changed the default to autofailback=off, and we also made sure that when you do a full restore on a node using the XML backup we make sure that heartbeat is stopped until you are ready to join the cluster.
Once the XML file is restored you can double check all the settings, make any required changes, possibly relocate the unit physically or logically and then choose to restart the heartbeat.
In this case we have restored a master node from scratch and when the heartbeat is restarted, the new master node will stay passive but join the cluster in a clean fashion. The slave node will keep handling the network traffic to ensure that their is now down time incurred.
All seems pretty simple, so why didn't we do this before? Well because we did have a couple of little gotchas that need to be dealt with by new code in the background. One of them was the built in replay attack protection in HA-Linux:
heartbeat: 2008/07/02_15:27:44 ERROR: should_drop_message: attempted replay attack [lbmaster]? [gen = 18, curgen = 1207732349]
We had a manual work around for this before but now the software will transparently deal with this issue when restoring from XML (easier said than done as curgen is stored within the process). The other little gotcha only effected customers using network based heartbeat in combination with ping nodes, we fixed this part of heartbeat ages ago so that it would send both nodes live in the case of a network failure (split-brain) or a ping node failure, however when the network connectivity was restored who would go live? This has now been sorted in the case of an XML restore to give the desired behavior of no down time.
BTW: The split-brain heartbeat problem is why we still ship a serial heartbeat cable with all of our hardware appliances.
On the subject of no downtime, yet another new feature seems to be causing some uncertainty with our users. Whenever you make a change to a layer 7 HAproxy configuration you now get the following prompt:
Just to re-assure you, the HAProxy restart (Ed. Why don't we rename it to reload then?) is seamless and does not cause downtime! HAProxy allows all existing connections to stay bound to the existing process and only connects new connections to the new configuration and process. The reason we changed from automatically restarting the layer 7 engine is because this can cause confusion with some users connecting to a new configuration and some connecting to the old one. With the new method you can make all the changes that you need to make i.e. multiple changes before doing the smooth restart/activation of the new settings. Also if you have a lot of long timeout connection i.e. Terminal Services you can choose to a full restart of HAproxy rather than the smooth reload to force all the connections to re-establish on the new configuration.The future progression of this feature is to allow full commit/rollback style functionality to the configuration interface.
One last point to note is that this reload functionality takes effect on both nodes in the cluster, this is to ensure that if you have full layer 7 session table replication activated in your configuration the passive node stays up to date with all of the changes.
As always high-availability is only possible if you test and document and re-test your disaster recovery procedures. We hope these changes to our product make your life easier, and the testing and validation procedure much quicker.