According to Microsoft, a change made to the Microsoft Wide Area Network (WAN) made Microsoft services inaccessible to users around the globe. The networking outage took down the cloud platform Azure together with business services including Teams and Outlook.
This recent Microsoft outage caused significant concerns among business users. The issue presents ongoing issues for network operations teams to solve in order to prevent future network outages and to be prepared to minimise future downtime.
To understand the significance, Digital Journal caught up with Josh Stephens, who has been involved with network engineering for over 30 years — first as an engineer with the U.S. Air Force, and more recently as CTO for BackBox.
Stephens begins by expressing incredulity over the incident: “It’s incredible that even the simplest configuration change or even a typo can sometimes cause a ripple effect and bring down a network and/or disrupt a supposedly fault-tolerant business service. Even tech giants like Microsoft aren’t immune.”
Stephens continues: “In many cases, the outage may not occur immediately after the configuration change was made and so it can be difficult to correlate the change during root cause analysis.”
Looking deeper into the central issue, Stephens notes: “While many news reports have keyed upon the fact that a configuration change caused such a widespread outage, the real headline is that it took them four hours to restore service.”
Stephens adds: “While this sounds exorbitant, without more technical details about the cause of the outage and, more specifically, the extenuating circumstances that extended the time it took to restore service, rather than pass judgment I will just honestly say, I’ve been there.”
In terms of lessons to be learned, Stephens has thoughts on how other network teams at organizations can be proactive now to avoid a similar disaster.
His first recommendation is: “Accelerate the speed of solving difficult technical problems to ensure there is solid documentation and up-to-date network maps.”
Secondly, Stephens advises: “Continuous, automated configuration auditing and remediation to ensure that all network devices are up-to-date and compliant with operational policies and industry standards.”
Stephens also adds, as his third consideration, that automated network configuration backups “allow you to instantly restore backups and have automated weekly or frequent OS updates and patches”.
Stephens’ fourth and final recommendation is: “At a minimum, your automation platform should create backups daily, before and after changes, and store a long history of backups within an autoscaling, fault-tolerant data store. Furthermore, it should be able to reliably conduct upgrades at scale and while employing at least mildly complex workflows.” While these approaches are useful, Stephens warns: “No single tool or approach can guarantee business continuity, but there are ways to be better prepared if the worst does happen.”