Facebook published a blog posting detailing Monday's service outage. According to Santosh Janardhan (the company's vice-president of infrastructure), the outage began with routine maintenance. Yesterday, a command was issued to evaluate the availability of the backbone network connecting all of Facebook's computing facilities. These connections were not taken down by the order. Janardhan claims that a bug in the company's internal audit system didn't prevent the command from being executed.
This caused a secondary issue that eventually made yesterday's outage into an international incident. Facebook stopped advertising the BGP routing information (border gateway protocol) that every internet device needs to connect to its primary data center when Facebook's DNS servers could not connect to it.
Janardhan stated that the end result was that DNS servers were no longer accessible, even though they were still functional. It was impossible for the rest to locate our servers.
Yesterday's outage meant that Facebook engineers couldn't connect to the servers to repair the problem. This made the situation even more difficult. They couldn't access many of their internal tools that they use to resolve network issues under normal circumstances due to the loss of DNS functionality. The company was forced to send employees to its data centers. This task was made more difficult by the physical safeguards in place.
Janardhan says that they are difficult to access and that the routers and hardware can be hard to modify, even if you have physical access. Facebook decided to not turn everything off immediately after it had restored its backbone network. This was because of the increased computing power and potential for more crashes.
Janardhan said that every failure is an opportunity to improve and learn. We review every issue, no matter how small or large, to find ways to improve our system's resilience. This process is already in progress.