On April 11th we received alerts of a short outage of less than 20 minutes for all services in our Denver location. The issue was resolved quickly by the data center staff and the following RFO (Reason For Outage) was provided:
Below, please find our reason for outage and root cause analysis reports related to the network disruption we suffered on April 11, 2019.
Timeline (All times are MT)
- 10:00 AM – A member of our provisioning time begins preparing normal router configuration changes to support a customer requirement.
- 10:46 AM – The changes are committed to our routers, using a “commit confirmed” statement, which would automatically rollback the changes after 5 minutes.
- 10:48 AM – We begin receiving a large volume of alerts indicating a major network disruption.
- 10:49 AM – Provisioning team engages management and senior network technical resources and begins the process of expediting the rollback of the configuration.
- 10:51 AM – It is evidently that in spite of rolling back the change, our network is still facing a disruption. Further troubleshooting efforts are underway.
- 11:00 AM – After reviewing logs and other technical data, we determine that our BGP peering to all of our external providers is not in a proper state. Specifically, all of our providers were forcing the BGP sessions offline because we exceeded the maximum number of prefixes allowed. We begin the process of clearing our BGP sessions, which we hope will force them back online.
- 11:02 AM – The sessions will still not come back online, and we begin engaging all of our providers via phone to get them to clear the BGP status on their side.
- 11:11 AM – Hurricane Electric circuit comes back online, and all alerts begin to clear and network access is restored.
- 11:40 AM – Cogent circuits comes back online.
- 11:49 AM – Zayo circuit comes back online.
Root Cause Analysis
Human error was the primary cause of this issue. A review of our logs and configurations on our border routers indicate that a typo in a routing “policy statement” resulted in our routers inadvertently advertising full BGP routes to our transit providers. This caused all of our providers to automatically shutdown our BGP sessions as soon as we reached our prefix announcement. We had to contact each of our 4 providers and get them to manually reset our BGP state to restore service.
After Action Plan
We will be conducting additional internal training to ensure that everyone on our team has an increased capability to make such routing changes. Additionally, we are currently investigating using commit scripts to validate such errors are not present, however, the research and development of such scripts will take some time.
We apologize for any inconvenience this outage may have caused and the delay in getting this RFO posted.
-The Secure Dragon Staff-
Tuesday, April 16, 2019
Powered by WHMCompleteSolution