12:20 - 11th Jan (UK) - the network issues at Clouvider Telehouse North and parts of Enfield are known, there is a significant and sustained DDOS attack impacting the network, work is in progress to mitigate it.
13:20 - 11th Jan (UK) - the network issues now seem to be isolated to Telehouse North, I have reached out to Clouvider for further updates.
13:50 - 11th Jan (UK) - the network issues have been resolved, awaiting official RFO from clouvider which will be published here.
REASON FOR OUTAGE
11th January 2018
Impact: Network Services
Between 0 minutes to 87 minutes, depending on the service and location of the interconnect.
Average impact depending on location:
Enfield: majority of L3 interconnects: 0 minutes, small number of routes were unreachable for up to 5 minutes as BGP re-converged. No impact to optical services.
Equinix LD8 / Harbour Exchange: no impact noted
Telehosue North 2: majority of L3 interconnects: 43 minutes. Small number of L2 interconnects 87 minutes. No impact to optical services.
At 11:17 Clouvider NOC observed a large scale DDoS attack across a number of our Customers. A number of subnets have been attacked. The pattern of the attack was identical. The DDoS mitigation started automatically.
At 11:32 we’ve noticed the volume of the attack is growing significantly with a high growth trend. We have re-routed the attack through our emergency mitigation facility with NTT, a global Tier 1 carrier with significant backbone and, as such, a large mitigation capacity. There was no impact on our network at this stage and this action was precautionary.
At 12:01 the attack peaked 100 Gbit/s for the first time.
At 12:07 the attack started shifting towards our link IPs – the IP addresses supplied by the carriers for the purpose of interconnection between networks. These subnets are very small, well below minimum routable subnet length of /24. As such they cannot be re-routed by us. We’ve started requesting temporary ACLs with our carriers, as a courtesy, to protect the link IPs.
To this time, we have been successfully mitigating the attack and no impact was observed to our network through the external monitoring systems.
At 12:15 we’ve observed a number of BGP sessions with our carriers has flapped across the network with the biggest taxing on the routing platforms visible at Equinix LD8 (Harbour Exchange) MX Core and Telehouse North 2 MX Core.
At 12:22:25 due to starvation of control plane resources caused by BGP sessions flapping with our carriers as well as Customers, Telehouse North 2 core has lost first internal BGP sessions with Equinix LD8 Core, and 39 seconds later remaining iBGP sessions. This has caused severe impact to all Customers connected at this Datacentre. All connectivity resiliency has been exhausted at that location. Equinix LD8 and Virtus Enfield routers re-calculated the routes and used local links to continue carrying traffic largely unaffected.
At 12:27 we have received an email demanding ransom to stop the attack / attack is now averaging in excess of 200 Gbit/s with peaks of 300 Gbit/s
At around 12:30 we’ve observed high internal traffic, later classified as attack, within our network aimed at a number of Customers connected to THN2 Core. Our DDoS mitigation system has not however received processed flow records due to earlier dropped internal BGP sessions with THN2 Core that affected its connectivity. We’ve tracked the attacker and disabled servers involved manually.
At 12:55 first ACLs were confirmed, we have started restoring the sessions.
At 13:05:29 connectivity to large majority of our Customers connected to THN2 Core was restored.
At this point we have informed Customers with open tickets that the connectivity is restored and asked for feedback in case it did not.
At 13:37 we have received first information that a small number of Customers using Layer 2 connectivity still has no connectivity. We have started investigating.
At around 13:45 we’ve narrowed down that all Customers reporting issues are traversing through distribution switches at THN2 (Juniper QFX5100 platforms). This was a result of a software bug in JunOS operating system that earlier manifested itself in December at the same pair of redundant switches. Bug is connected with PFE (Packet Forwarding Engine) misprogramming on the low level by JunOS. The bug was triggered when we adjusted configuration of the QFX switches during the outage. We run brand new equipment and have a support contract with Juniper, as such Juniper JTAC has recently provided us with a service software release fixing this problem. This is currently being successfully tested in our lab and we provisionally intended to announce a window to upload it into production on the weekend night of 20/21 January.
At 13:49 we have committed configuration change forcing QFX switches to re-program the PFE, this has immediately resolved issues affecting some Customers utilising Layer 2 connectivity traversing through these switches.
At this point all connectivity has been fully restored.
Since 3PM we have noticed the attack is significantly reducing
At 20:47 Attack has ceased.
- We need to protect link IPs to avoid this from happening in the future, or minimise impact. Carriers refuse implementing permanent ACLs for the IPs provided by them; Local scrubbers won’t help when links are wholly flooded with the attack; Mitigation tunnels cannot be used as subnets are too small to be routable.
- DDoS protection system needs independent connectivity so it can continue working when site-to-site connectivity is affected.
- PFE misprogramming bug on QFX has to be resolved; it has increased the downtime to some L2 Customers.
Actions that will be taken to avoid this incident from re-occurring:
- We have partnered with one of our global Tier 1 carriers to increase capacity with them by another 10Gbit/s, they will construct a commercial solution to protect the link IP with them on their end. This will allow us to carry all our traffic through them in case of this particular attack scenario, with connectivity largely unaffected to any of our Customers.
- We will provision dedicated wavelength ring on our DWDM infrastructure for the DDoS protection system tonight so it is separated from the core.
- We’ll schedule maintenance window to implement the service release of JunOS fixing the issue with PFE misprogramming on QFX platform/distribution switches.
Thursday, January 11, 2018