/ fullwidth

Oops. The internet ran out of routes

Yesterday, internet service providers around the world and the services that used their networks started going offline and experiencing abnormal levels of packet loss and latency. The issue was widespread and affected a great many services. What happened?

The internet ran out of routes.

**Update: **

Here is a more thorough explanation of the issues of a couple of days ago: http://www.bgpmon.net/what-caused-todays-internet-hiccup/

The IPv4 public address space (the available publicly-routable IP addresses) has become more and more fragmented as companies have broken down the address blocks they own more and more and sold small chunks of their IPs on. Each distinct IP block that is connected to the internet needs to be routed to and from by routers all over the world and so requires an entry in the global BGP routing table. This is data stored by routers that ISPs operate.

Most routers, especially older pieces of hardware, have a limit of 512,000 odd routes that they can hold in their global routing table. This limit was mostly defined by the available memory the router had and when IPv4 was conceived, 512,000 routes seemed like a ludicrously high number.

As a new address block is carved off in the IPv4 address space, another route is advertised for that netblock. Yesterday, we exceeded the hard limit of 512k routes that most routers could hold.

DatePrefixes  CIDR Aggregated
06-08-14  511103  280424
07-08-14  511297  280432
08-08-14  511736  280442
09-08-14  511719  280722
10-08-14  511762  280563
11-08-14  511719  280860
12-08-14  511648  280869
13-08-14**  512521**  280918
Modern routers didn’t all have this limitation – in fact [Cisco](https://supportforums.cisco.com/document/12202206/size-internet-global-routing-table-and-its-potential-side-effects) and [other people](http://v4escrow.net/ipv4-routing-table-expansion/) posted an advisory about the impending issues that the growing number of IPv4 routes were going to cause – but routers have typically been deploy-and-forget devices that are set up and then run with minimal interaction. Older devices were mostly configured with a default of 512k. When the number of advertised routes exceeded 512k, in the words of redditor [DiscoDave86](http://www.reddit.com/u/DiscoDave86),

“Upon further investigation it appears that the IPV4 public address space exceeded 512k routes, and some older routers have that as their limit due to memory constraints, consequently a whole load of routers became expensive doorstops”

The fix is simple enough: reconfigure the routing table limits on your router, but it requires a reboot of the device and rebooting a core router is not a task undertaken lightly.

Many ISPs have been scrambling to reconfigure their hardware to make sure they aren’t stung by this again, but the effects have been far-reaching as a result.


Update: I’ve drawn some quick & dirty nodegraphs to illustrate what happens when routers reboot.

In this (very simplistic) illustration of the Internet, Node 1 is trying to connect to Node 7. The bold path is the path its network traffic takes across the ‘net.


Everything is normal – traffic is routed according to the hop distance (fewest nodes to target). This isn’t always how it works in reality, but for the purposes of this example, it’ll do.Selection_046 Node 4’s administrator notices the problem, applies the fix and reboots the router, causing all routes that are using Node 4 to fail and have to be re-calculated.

Selection_047 While Node 4 is rebooting, Node 8, which is operated by someone else, also starts to reboot to apply the changes to the maximum size of the routing table. N1 > N8 > N7 is no longer valid, so route is recalculated to N1 > N2 > N6 > N7Selection_048Nodes 4 and 8 are offline pending a reboot, so the path from N1 to N7 is routed through N2 and N6. *Any addresses behind N4 and N8 are offline and become un-routable. It’s as though they no longer exist. *

Selection_049N4 is back online but now has to re-create its routing table and only adds N1 and N7, so it can no longer route to N3 and N5

Selection_050N8 is back online and starts to recreate its routing table, adding N1 and N4 as its available nodes.

Selection_051After the nodes reboot, this is the final state of the network. As you can see, N4 and N8 have not got their original routes back, necessarily.

This is a very simplistic representation of what happens when you reboot a core router attached to the Internet, such as those that the likes of L3, AboveNet, TiNET, NTT operate. I haven’t included link costs in this diagram, either.

Last night, when many ISPs were doing this, entire blocks of addresses simply became un-routable. You didn’t get timeouts or dropped packets or lag. They just didn’t exist, for all the Internet was concerned.