Advertisment

Microsoft's Global Glitch: Expert Takes on the Outage Fallout

Microsoft's Global Glitch recently led to a massive global outage, disrupting Microsoft Azure services and impacting worldwide. Experts discuss the causes, recovery challenges, and implications for cybersecurity and system resilience."

author-image
Manisha Sharma
Updated On
New Update
Microsoft's Global Glitch

On July 19, 2024, the tech world was rocked by one of the most significant cyber disruptions of the year due to a critical malfunction in CrowdStrike's EDR product software update. This glitch triggered a massive global outage, severely impacting business processes and raising urgent questions about cybersecurity and system resilience.

Advertisment

The Outage Unfolds

In the early hours (IST), reports surfaced of an outage in Microsoft's Azure cloud service, initially affecting users in the Central U.S. region. The outage quickly spread to several countries, including India, causing widespread disruptions. Flight operations and air traffic were severely impacted, forcing airports to revert to manual operations. Brokerages and stock exchanges also experienced major disruptions, throwing the digital lives of many into disarray.

Explore expert insights on this major disruption. Dive into detailed analyses from industry leaders regarding the incident and its implications:

Advertisment

Analyzing the Root Causes and Recovery Challenges:

Omer Grossman, Chief Information Officer (CIO) at CyberArk:

The current event appears – even in July – that it will be one of the most significant cyber issues of 2024. The damage to business processes at the global level is dramatic. The glitch is due to a software update of CrowdStrike's EDR product. This is a product that runs with high privileges that protects endpoints. A malfunction in this can, as we are seeing in the current incident, cause the operating system to crash.

Advertisment

There are two main issues on the agenda: The first is how customers get back online and regain continuity of business processes. It turns out that because the endpoints have crashed - the Blue Screen of Death - they cannot be updated remotely and this the problem must be solved manually, endpoint by endpoint. This is expected to be a process that will take days.

The second is around what caused the malfunction. The range of possibilities ranges from human error - for instance a developer who downloaded an update without sufficient quality control - to the complex and intriguing scenario of a deep cyberattack, prepared ahead of time and involving an attacker activating a "doomsday command" or “kill switch”. CrowdStrike's analysis and updates in the coming days will be of the utmost interest.

The Growing Dependence on Big Tech and Cyber-Resilience

Advertisment

Jake Moore, Global Security Advisor at ESET:

These outages are increasing in volume due to the sheer increase in the number of online users and traffic. After witnessing the blue screen of death (BSOD), many people are quick to suspect a cyberattack or find similarities to Netflix’s Leave The World Behind but this can often add to the confusion. It highlights the importance of these services and the millions of people they serve.

Businesses must test their infrastructure and have multiple fail safes in place, however large the company is, this is typically referred to as a cyber-resilience plan. But as often is the case, it is simply impossible to simulate the size and magnitude of the issue in a safe environment without testing the actual network.

Advertisment

The inconvenience caused by the loss of access to services for thousands of people serves as a reminder of our dependence on Big Tech such as Microsoft in running our daily lives and businesses. Upgrades and maintenance to systems and networks can unintentionally include small errors, which can have wide-reaching consequences as experienced today by Crowdstrike’s customers.

Another aspect of this incident relates to “diversity” in the use of large-scale IT infrastructure. This applies to critical systems like operating systems (OSes), cybersecurity products, and other globally deployed (scaled) applications. Where diversity is low, a single technical incident, not to mention a security issue, can lead to global-scale outages with subsequent knock-on effects.

The Rising Costs and Future Implications

Advertisment

Omdia Senior Director, Cybersecurity Maxine Holt:

The global IT outage crisis is escalating, and organizations everywhere are in full scramble mode, desperately implementing workarounds to keep their businesses afloat. Microsoft has pointed fingers at a third-party software update, while CrowdStrike admits to a "defect found in a single content update for Windows hosts" and is working feverishly with affected customers. Omdia analysts connect the dots: this isn't a cyberattack, but it’s unquestionably a cybersecurity disaster.

Cybersecurity's role is to protect and ensure uninterrupted business operations. Today, on 19 July 2024, many organizations are failing to operate, proving that even non-malicious cybersecurity failures can bring businesses to their knees. The workaround, involving booting into safe mode, is a nightmare for cloud customers. Cloud-dependent businesses are facing severe disruptions.

Advertisment

Omdia’s Cloud and Data Center analysts have long warned about over-reliance on cloud services. Today’s outages will make enterprises rethink moving mission-critical applications off-premises. The ripple effect is massive, hitting CrowdStrike, Microsoft, AWS, Azure, Google, and beyond. CrowdStrike's shares have plummeted by more than 20% in unofficial pre-market trading in the US, translating to a staggering $16 billion loss in value.

Looking forward, there’s a shift towards consolidating security tools into integrated platforms. However, as one CISO starkly put it, "Consolidating with fewer vendors means that any issue has a huge operational impact. Businesses must demand rigorous testing and transparency from their vendors."

CrowdStrike's testing procedures will undoubtedly be scrutinized in the aftermath. For now, the outages continue to rise, and the tech world watches as the fallout unfolds.

The Fragility of Interconnected Systems

Srirang Srikantha, Founder & CEO, Yethi Consulting:

"The outages represents how fragile and interconnected our systems are. Companies like MSFT have great practices, and the fact that a bug passes through its process is unfortunate. It reiterates the need for good practices of testing before releasing new software to production systems."

Designing Systems with Failure in Mind

Ashish Tandon, Founder and CEO, Indusface:

"The recent CrowdStrike Falcon disruption underscores the need for designing systems with failure in mind. This incident highlights the importance of robust contingency planning and shared responsibility between software vendors and businesses. By incorporating fail-safes and redundancy into system architectures and developing comprehensive backup plans, we can enhance the resilience of our digital infrastructure and ensure the continuity of critical services. At Indusface, our AppTrana WAAP solution exemplifies this approach, ensuring that even during failures, our clients' applications remain secure and operational."

Conclusion:

The July 19, 2024, outage is a stark reminder of our reliance on cloud services and the potential for widespread disruption when they fail. It underscores the need for robust cybersecurity measures, rigorous testing, and comprehensive contingency planning. As the tech world grapples with the fallout, the focus will inevitably shift to preventing such incidents in the future and ensuring that the digital backbone of our global infrastructure remains resilient and secure.