Tampa
Earlier on Monday (5/27) we received alerts that our remote management was inaccessible to our primary router. Normally this is a non-critical issue as there was full connectivity so we were going to schedule some time next week to resolve it if the problem did not clear up on its own. Later in the evening we noticed that our network monitoring was not reporting properly for our router which is a much more severe issue so we decided to head to the data center to resolve this immediately to prevent any possible unplanned outages. Luckily our network monitoring continued functioning properly on our switches so we were still able to confirm network connectivity and health.
Steve arrived at E Solutions at 10PM and had the backup router online shortly after. We repaired the primary router and configured the new backup switch with VRRP. All testing was successful. We failed back to the primary router which was working properly now and configured VRRP on it also. Again, all tests were successful. We introduced both switches to the network and they were able to see each other, both of our switches, and both of E Solutions' edge switches. Traffic was routing normally and there were only a few packets lost while VRRP got situated (a force reset of the VRRP status on both routers was required). Once both routers were 100%, we started unplugging cables (OH NO!!!!), as expected, we lost 1 ping packet during most of the cable unplugs with some unplugs only resulting in an increased latency on 1 packet. In short, we were able to unplug each cable to both routers, both switches, and even each server with no network loss. Farewell Vyatta. :)
Portland
On Monday (5/27) there was a 5 minute outage for our Portland location at around 3PM EST. This was due to misconfigured hardware in the data center which resulted in a core router being rebooted. Below is the RFO recieved:
Dear customers,
We had a router crash while our technician attempted to apply an ACL to filter a SYN attack for a client, the router took 5 minutes to boot up again. We sincerely apologize for the incident.
Additionally we received this e-mail a few hours later:
Dear customers,
We will be performing firewall OS upgrade for multiple firewall units around midnight today to address an overly aggressive filtering bug as well as optimizing performance.
Firewalls will be upgraded one by one. We don't expect any downtime for customers during the upgrades.
It's been an interesting Memorial Day for us today, luckily all of the unexpected issues resulted in a much better and a more resilient network in 2 of our 3 data centers.
Thanks for your understanding during these issues today. We appreciate your continued support and we will continue working hard to provide you the level of service you expect from us.
-The Secure Dragon Staff
mardi, mai 28, 2013
