Facebook and its other platforms, including Instagram, WhatsApp and Messenger, went down globally for close to six hours on Monday and Tuesday, depending on your time zone. As services are being restored, questions are being asked about what caused the outage, and why it took so long to fix.
Why did Facebook go down?
Just before 5 pm UTC, people began noticing they could not access Facebook, Instagram, WhatsApp or Messenger. It would be more than five hours before services would begin to be restored.
Facebook issued a statement on Tuesday confirming that the cause of the outage was a configuration change to the backbone routers that coordinate network traffic between the company’s data centres, which had a cascading effect, bringing all Facebook services to a halt.
It meant not only was Facebook gone, but everything Facebook runs disappeared too.
Cloudflare – which had its own recent internet outage issues – has provided a detailed explanation about what happened.
It involves two things that sort out how the internet is the internet – that is Domain Name System (DNS) and Border Gateway Protocol (BGP).
The internet is a lot of connected networks. A lot. So that means to keep the order of things, you need something like BGP to tell you where you need to go. DNS is essentially the address system for the location of each website – its IP address – while BGP is the roadmap that finds the most efficient way to get to that IP address.
Cloudflare said Facebook on Monday essentially told BGP through a series of updates that those paths to Facebook no longer existed. But not just for Facebook, everything Facebook runs. That meant people trying to reach Facebook couldn’t find the path to access it.
Why were Instagram, Messenger and WhatsApp down?
All of Facebook’s services were affected, not just Facebook. It included Facebook’s own internal systems, with reports staff were locked out of offices, and could not access their own internal communications platform.
Why did it take so long to fix?
Facebook’s own internal systems are run from the same place so it was hard for employees to diagnose and resolve the problem.
As the Guardian’s UK technology editor, Alex Hern, put it on Twitter, “Facebook runs EVERYTHING through Facebook”, so the usual way you would fix a problem like this was also not working.
Facebook staff were reportedly unable to access their own communications platform, Workplace, and were unable to access their office due to the security pass system being caught up in the outage.
Facebook indicated the duration and severity of the outage meant the systems were being brought back to full capacity slowly.
How did they eventually fix it?
Facebook so far has not gone into much detail about what went wrong and how it was fixed, but there were multiple reports the social media giant sent a technical team out to its servers in California to manually reset the servers where the problem originated.