Facebook went offline Monday. This took Instagram, WhatsApp, and a few other websites with it. Many people have claimed that Facebook was downed due to BGP (or Border Gateway Protocol), citing traffic analysis and sources inside Facebook. However, this raises the question:
What is BGP?
BGP, which is the basic system that the internet uses for getting your traffic where it needs to be as fast as possible, is the most basic. There are many routes that your packets can take, due to the fact that there are so many internet service providers and backbone routers. BGPs are responsible for guiding them and making sure they take the best route.
BGP has been described many times as a network of post offices, air traffic controllers, and other things. But my favorite description was that it could be compared to a map. BGP is a group of people who create and update maps that will show you how to access YouTube or Facebook.
BGP is like a map that tells your computer which bridges it must cross in order to reach Facebook
The internet is divided up into large networks known as autonomous systems when it comes to BGP. They can be viewed as islands. These networks are managed by one entity. This could be an ISP like Comcast or a company like Facebook. It would be very difficult to build bridges linking all islands to each other, so BGP tells you which islands (or autonomous system) you need to cross to reach your destination.
The internet is constantly changing so the maps must be updated. You don't want your ISP leading you down an old route that doesn't go to Google. Autonomous systems share their maps because it would be difficult to map all of the internet at once. They will occasionally communicate with their island neighbors to share any changes to their maps.
Turn left and follow the river.
It's easy to see how things could go wrong when you use maps as a guide. There were jokes back when GPS was first made available to consumers. It could lead you off a cliff, or even into the middle the desert. BGP can also lead traffic to places it is not supposed to be. It will be on everyone's map if it isn't caught. This can also go wrong in other ways, but we'll get to those later.
Yes, maps. Let me give you an example.
Yes! Although this is a simplified example, imagine that you would like to connect to Convergence, an imaginary news site about tech. You use DecadeConnect and Convergence uses NetSend as its ISP. DecadeConnect can't talk to NetSend directly, but Border Communications can talk with your ISP, which can talk back to Form. Form can also talk to NetSend. BGP would ensure that Convergence and you could communicate via that route if that is the only way. If, however, DecadeConnect and NetSend were connected via ThirdLevel, then BGP would most likely route your traffic through ThirdLevel as it is a shorter hop.
So BGP is like maps, which detail the fastest routes from you to a site.
Yes! It can get more complicated, however, because the shortest route doesn't always mean the best. There are many reasons why routing algorithms might choose one route over another. Cost can also be a factor, with some networks charging other networks if they include them in their routes.
It is difficult to map unchanging roads; how about mapping the internet?
Maps can be very tricky. This was something I found out when trying to plan a trip that included roads on different maps. Three different names were given to one road on three different maps. Imagine trying to connect all five roads together in a town with so many roads. While real roads don't change very often, websites can change their service providers or move between countries. The internet must deal with this.
This is what I recall from my class in algorithms and data structures, where we were trying to create algos that would find the shortest route.
I'll take your word for it. As soon as graphs were mentioned, I quit.
But Facebook did not! According to a paper published earlier this year, Facebook actually built its own BGP system that allows it to do incremental updates quickly. The company claims that the system is for communication within data centers. It's difficult to determine what caused Facebook's Monday problems. I don't know enough to tell if Facebook's datacenter communications were to blame. Bryan Krebs, a cybersecurity reporter, claims that the outage was caused a routine BGP upgrade.
What does DNS have do with all of this?
Cloudflare has a great explanation: DNS tells where you are, while BGP tells how to get there. DNS is the way computers find out what IP address a website is located at. However, that knowledge is not enough to help you get to your friends house.
Cloudflare has also provided a detailed technical explanation on how BGP errors can also affect DNS requests. The article is about Monday's Facebook incident. It's worth reading if you want to understand what it looked like from an autonomous system perspective.
What could go wrong with BGP
There are many things. Cloudflare cites two incidents: a Turkish ISP inadvertently telling the internet that it should route all traffic to its service in 2004, and a Pakistani ISP accidentally banning YouTube from the world after it was only intended to serve its users. BGPs ability for autonomous systems to spread (which is why it is so useful) means that one mistake can lead to another.
BGP is often referred to as the "duct tape" of the internet.
In 2018, hackers were able hijack requests to Amazons DNS to steal thousands of dollars worth Ethereum. They also managed to compromise a separate ISPs BGP server to cause problems. Although Amazon was not the victim of the hack, traffic from it ended up somewhere else.
You can also mess it up and remove your entire service from the internet by installing a bad BGP upgrade. BGP is affectionately known as the duct tape for the internet. However, no adhesive is perfect.
What happened to Facebook?
It appears that Facebook's servers told everyone to remove them from their maps. We will likely need to wait for Facebook to report on what happened to the BGP configuration of its servers and why it was changed. Cloudflares CTO reported that there were a lot of BGP updates sent to the service by Facebook, most of which were route withdrawals or erasing lines from the map leading to Facebook, right before it went dark. Fastly's tech lead tweeted that Facebook had stopped offering routes to Fastly after it went offline. KrebsOnSecurity supports the theory that it was an update to Facebook's BGP that shut down its services.
Cloudflare's explanation is a good choice if you are looking for the most technical details.
How does Facebook address the BGP problem?
The outage lasted for several hours so it seems that this was not an easy task. Facebook had to ensure that it was correctly advertising the records and that they were being picked up by the wider internet. It also needed to ensure that its maps were accurate and accessible by everyone.
It's not an easy task, however. Reports of Facebook employees being barred from badge-protected doors, and employees having difficulty communicating with one another were common. You need to not only figure out who is able to solve the problem and who has permissions, but also how to connect them. It's not an easy task when your entire company is in trouble. The Verge has received reports that engineers were sent to California by Facebook to resolve the problem.
Web3 could solve this problem
Stop it. I will weep.
To answer your question quickly, no, not even if Facebook joined the decentralized train. There would still need to be some protocol that tells you where to find its resources. It's possible to make mistakes or mess up blockchain contracts, so it would be suspicious if anyone said that a contract and blockchain based internet are immune to this type of problem.
It was certainly a questionable timing of the outage, given all the negative Facebook news.
It is obvious that this happened at the same time a whistleblower was on TV, and it makes it easy to find other explanations. It is possible, however, that this was an innocent error made by a member of Facebook's IT staff.