Yesterday, internet service providers around the world and the services that used their networks started going offline and experiencing abnormal levels of packet loss and latency. The issue was widespread and affected a great many services. What happened?
The internet ran out of routes.
Here is a more thorough explanation of the issues of a couple of days ago: http://www.bgpmon.net/what-caused-todays-internet-hiccup/
The IPv4 public address space (the available publicly-routable IP addresses) has become more and more fragmented as companies have broken down the address blocks they own more and more and sold small chunks of their IPs on. Each distinct IP block that is connected to the internet needs to be routed to and from by routers all over the world and so requires an entry in the global BGP routing table. This is data stored by routers that ISPs operate.
Most routers, especially older pieces of hardware, have a limit of 512,000 odd routes that they can hold in their global routing table. This limit was mostly defined by the available memory the router had and when IPv4 was conceived, 512,000 routes seemed like a ludicrously high number.
As a new address block is carved off in the IPv4 address space, another route is advertised for that netblock. Yesterday, we exceeded the hard limit of 512k routes that most routers could hold.
“Upon further investigation it appears that the IPV4 public address space exceeded 512k routes, and some older routers have that as their limit due to memory constraints, consequently a whole load of routers became expensive doorstops”
The fix is simple enough: reconfigure the routing table limits on your router, but it requires a reboot of the device and rebooting a core router is not a task undertaken lightly.
Many ISPs have been scrambling to reconfigure their hardware to make sure they aren’t stung by this again, but the effects have been far-reaching as a result.
Update: I’ve drawn some quick & dirty nodegraphs to illustrate what happens when routers reboot.
In this (very simplistic) illustration of the Internet, Node 1 is trying to connect to Node 7. The bold path is the path its network traffic takes across the ‘net.
Everything is normal – traffic is routed according to the hop distance (fewest nodes to target). This isn’t always how it works in reality, but for the purposes of this example, it’ll do. Node 4’s administrator notices the problem, applies the fix and reboots the router, causing all routes that are using Node 4 to fail and have to be re-calculated.
While Node 4 is rebooting, Node 8, which is operated by someone else, also starts to reboot to apply the changes to the maximum size of the routing table. N1 > N8 > N7 is no longer valid, so route is recalculated to N1 > N2 > N6 > N7Nodes 4 and 8 are offline pending a reboot, so the path from N1 to N7 is routed through N2 and N6. *Any addresses behind N4 and N8 are offline and become un-routable. It’s as though they no longer exist. *
This is a very simplistic representation of what happens when you reboot a core router attached to the Internet, such as those that the likes of L3, AboveNet, TiNET, NTT operate. I haven’t included link costs in this diagram, either.
Last night, when many ISPs were doing this, entire blocks of addresses simply became un-routable. You didn’t get timeouts or dropped packets or lag. They just didn’t exist, for all the Internet was concerned.