The self-described “intelligent global network” provided by Internet security and caching service CloudFlare took a coffee break this morning, forcing a number of the Web’s top sites offline for the better part of an hour or so.
Or, as CloudFlare CEO Matthew Prince put it in today’s post-mortem blog post, “CloudFlare effectively dropped off the Internet.” And when he says that, he means it, literally the outage also took CloudFlare’s own site offline in addition to sites like 4chan, Wikileaks, and the other 785,000 or so websites making use of CloudFlare’s services.
So, what happened?
First up, it’s important to understand what CloudFlare actually does. It serves as an intermediary of-sorts for those looking to access sites that make use of the service, caching static pages to speed up load times and using its anycast DNS capabilities to filter out malicious traffic like distributed denial of service attacks to keep its members’ sites online and unbothered. Or, as CloudFlare describes:
“The nature of CloudFlare’s Anycasted network is that we inherently increase the surface area to absorb such an attack. A distributed botnet will have a portion of its denial of service traffic absorbed by each of our data centers. “
According to CloudFlare, the company noticed a DDOS attack against one of its member sites early this morning. A member of CloudFlare’s operations team sent out a tweak to CloudFlare’s routers that was designed to get them to drop any packets that appeared to be part of the attack identified as packets ranging from 99,971 to 99,985 bytes in length.
“Flowspec accepted the rule and relayed it to our edge network. What should have happened is that no packet should have matched that rule because no packet was actually that large. What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed,” Prince wrote.
“In all cases, we run a monitoring process that reboots the routers automatically when they crash. That worked in a few cases. Unfortunately, many of the routers crashed in such a way that they did not reboot automatically and we were not able to access the routers’ management ports. Even though some data centers came back online initially, they fell back over again because all the traffic across our entire network hit them and overloaded their resources.”
CloudFlare’s network and operations teams ultimately had to remove the aforementioned filter rule from its routers and have its data center employees manually reboot the affected routers. For CloudFlare customers protected by service-level agreements, the company plans to issue credit for today’s hour or so worth of downtime.