On the 4th October 2021, the social media giant Facebook experienced a global outage, affecting not only Facebook, but also other Facebook products, including Instagram and WhatsApp. The outage lasted for around 6 hours.
This of course affected many people all across the world, but now that the issue has been resolved and Facebook have released more details behind the cause of the outage, is there anything that other companies can take away from this?
Well the seemingly obvious answer would be some form of HA (high availability) or redundancy, but it's not quite as simple as that (although that is definitely a good start if you don't already have any!).
While Facebook (and its other platforms) already use a wide range of high availability (HA) and redundancy in their design, the issue was caused by some incorrect BGP (Border Gateway Protocol) rules, being applied, that essentially caused Facebook to "disappear".
In this case, any HA solution in place, would simply replicate the problem, so it's not going to be able to stop it. Another way to look at it is as a high availability database that's clustered over several servers - if the underlying data is damaged or corrupted, that damage or corruption is replicated across all servers in the cluster. Hence HA alone here would not be of use because redundancy or high availability is not the same as a backup.
As we all know, mistakes happen and that, to some extent, is to be expected. Generally speaking, I have often found that how we recover from them is often more important than the cause.
Communication to your customers during any incident is vital. Leaving customers in the dark is only going to anger them further.
One of the more interesting points from their follow up was that the issue also impacted a number of their internal systems, thus slowing down response to the incident and delaying physical access to the data centers.
This is a great example of why sometimes it's best to keep your systems logically separated out. While I can understand why a company the size of Facebook would want to have a central way of managing access to their data centers, this certainly seems like a potential point of failure. It's like keeping the spare key to your car in the glove box - if you lock yourself out of the car, that key is of no use to you now.
Additionally, this also highlights the need for good disaster recovery planning. Having Disaster Recovery (DR) plans in place is one thing, but making sure they are accessible at all times (even if you lose access to your internal systems), kept up-to-date, and regularly tested, are also very important.
Also important is ensuring that your teams know what to do when those DR plans need to be enacted! There is a reason why, in the aviation industry, pilots are taught not just how to fly or land an aircraft at an airport, but also how to handle varying incidents (from an on-board fire, to landing in the sea) - in the same way most places will carry out regular fire drills. In both cases this is so that if that situation ever occurs, you know what to do and what to expect i.e. there is some familiarity to it.
On a separate note, this incident also provided a great opportunity for other services to get some free real world load testing on there applications, with the likes of Reddit and Twitter, picking up a large amount of extra traffic compared to normal. So if you have a web application and if one of your competitors has an issue, can your current solution scale to handle the additional volume of traffic to your site?
As part of the Covid-19 safety measures that were introduced, a large number of employees were working from home and not in the data center. This, combined with issues physically accessing the data center, exacerbated the issue and resolution time further. Back in March or April 2020 it could be reasonable to forgive a company for not immediately adapting there DR plans to the ever changing situation at the time. But 18 months on (where working from home has become the "new norm") it's a good example of why all plans, policies and procedures, need regular reviews to ensure they are up-to-date and still effective.
One final area to note is Change Control. Most medium to large companies will have some form of change control in place, designed to address issues that result in changes to scope or any other part of the baseline plan. While change control is far from perfect, it can certainly help to catch potential issues like this (either with peer review or other appropriate change planning measures).
Good change control measures will typically include a change proposal (what it is you want to change and why), at least 1 stage of review (typically by someone who wasn't originally involved but has the same technical knowledge on the subject matter), detailed instructions on making the change, but also more importantly in this case, a Rollback Plan (i.e. a way to revert these changes, in the situation that it does not go as expected).
With all that being said, it's very easy to sit here and write this with the benefit of hindsight, knowing that it had little to no impact on us. Millions of people were affected by this outage (Like a few others I got one panicked message from a Family Member that "The Internet was DOWN!", Thankfully it was just Facebook), but I feel that it's an excellent opportunity for all of us to look at what we do at the moment, learn lessons from the Facebook outage, and see if there's an opportunity to improve our own plans and procedures.