On July 19, 2024, there was an unprecedented IT outage that was reported to have been triggered by a software update by Crowdstrike, a cyber security firm. . This is one of the largest IT outages that caused disruptions across businesses, airports, broadcasters, healthcare, and retailers to name a few. In this blog post, I am going to highlight the details on the causes of this outage, what was affected, insights as well as threats on global IT infrastructure.
What caused the mass IT outage?
- 1 What caused the mass IT outage?
- 2 What were the effects of the mass IT Outage:
- 3 How Can Such Outages Be Prevented: Possible Solutions and Recommendations
- 4 Conclusion
- 5 Frequently Asked Questions (FAQ)
- 5.1 1. What caused the global IT outage on July 19, 2024?
- 5.2 2. How did the outage affect airports and airlines?
- 5.3 3. What impact did the outage have on businesses?
- 5.4 4. How did the IT outage affect healthcare services?
- 5.5 5. What steps are being taken to fix the issue caused by the outage?
- 5.6 6. How can organizations prevent similar outages in the future?
This global IT outage began early on 19 July 2024 when an American cyber security firm, CrowdStrike, known for providing antivirus solutions to Microsoft, was updating their software on Microsoft servers. Most applications that are dependent on Microsoft 365 were severely affected leading to system crashes and a large number of other disruptions. The issue was not a cyberattack but was identified as a defect in a software update.
According to Crowdstrike, Millions of devices were affected by the faulty software update. This led to several systems showing the infamous “Blue Screen Of Death”. Companies had to resort to manual reboots in safe mode, which was labor-intensive and a slow process. As reported by BBC News the impact was immediate and extensive in that business operations were interrupted, flights canceled, broadcast outages, and hospital visits were stopped. This outage highlighted the dangers of reliance on digital infrastructure and a few major technology service providers.
What were the effects of the mass IT Outage:
1. Airports and Airlines
According to Sky News, the IT outage caused massive disruptions at airports causing cancellation of flights, over 1,800 in the US alone and more than 42,000 globally. Big airlines such as Delta Airlines, American Airlines, and United Airlines experienced long queues and passengers being frustrated due to operational disruptions. Gatwick Airport also experienced significant operational challenges due to the outage.
2. Media and Broadcasting
Regular programming, and live television broadcasts experienced system interruptions with some TV channels being taken off the air. Entertainment and News dissemination were severely impacted as the New York Times highlighted in most areas that had outages.
3. Businesses
Various business sectors were affected as many companies were unable to manage or process operations due to the outage. For example, FedEx customers experienced package deliveries. Starbucks and Walmart struggled with mobile ordering features and order management. Debit and credit payment systems were unable to offer services due to operational challenges.
4. Health Sector
Facilities such as Memorial Sloan Kettering Cancer Center, NHS facilities, and Brigham and Women’s Hospital faced serious disruptions and had to postpone or cancel other not-so-urgent patient requests as they were unable to access electronic medical records. Challenges and inefficiencies were experienced and they had to rely on paper-based records.
How Can Such Outages Be Prevented: Possible Solutions and Recommendations
1. Incident Response Planning:
- There is a need for predefined steps and clear communication strategies to mitigate and manage the future impact of IT outages.
- Updating and regularly developing incident response plans should be mandatory and prioritized.
2. Improved Testing Protocols:
- It is important to employ rigorous testing before deploying updates on a larger scale to avoid widespread disruptions and identify potential issues.
- Companies should invest in enhanced testing protocols regularly to prevent similar outages in the future.
3. Diversification and Redundancy
- Technological solutions should be diversified and redundancy should also be incorporated into critical systems to reduce the severity of failure impact.
- Companies must reconsider revising policies of relying on single technology service providers to reduce the risk of widespread outages.
4. Invest in enhanced Backup Systems
- Organizations need to have backup procedures for critical systems and data
- They need to invest in advanced backup systems to ensure that companies continue to operate in the event their principal systems fail.
5. Communication and Transparency
- Stakeholders and customers during outages or other system challenges, must make it mandatory to maintain transparent communication.
- Providing clear information, regular updates and expected resolution times about the problem can help in managing expectations.
6. Continuous Monitoring and Evaluation
- Companies should prioritize implementing monitoring and evaluation practices to detect and address potential issues before they blow out of proportion.
- Taking a proactive rather than reactive approach can help to prevent and identify problems that cause significant system challenges.
7. Investment in Cybersecurity
- It is highly recommended for organizations to invest in advanced cybersecurity measures.
- They must ensure that all systems, software, and security protocols are regularly updated to mitigate potential attacks.
Conclusion
The global IT Outage of 19 July 2024, has served as a reminder of our dependence on digital infrastructure and the problems associated with it. The defective software update from Crowdstrike triggered massive outages which had deep impacts on various sectors such as broadcasting, travel, business, healthcare, and retail, to name a few. This resulted in the cancellation of thousands of flights, businesses facing operational instabilities, and healthcare services struggling to maintain their electronic routine care. That is the problem. Companies must not force updates to a large scale of potential devices all at once.
If deploying updates, or patches to a single machine, it is possible and doesn’t matter if the update works or not. When deploying to 3 or more devices in a specified region, you can still push it out to all 3 at once and if something goes wrong, the impact won’t be that great since you can easily roll back all 3. Spare a thought to the poor IT person who suddenly got told that all 5,000 servers are stuck in a boot loop in the middle of the night. And now it means that they have to manually fix each of their 5,000 servers.
This is the very reason why canary updates and analysis became a thing. When manually doing things doesn’t scale anymore it is automated and monitored. Any issue shows up, the plug is pulled and not everyone/every device is affected. This is the way. That’s why not every iPhone or Android device gets the update at the very same time. They don’t update your devices all at once.
Frequently Asked Questions (FAQ)
1. What caused the global IT outage on July 19, 2024?
Answer: This Global IT Outage was not a cyber attack but was caused by a defect in a software update on Microsoft Servers by a cybersecurity company called CrowdStrike. The update was supposed to enhance cybersecurity for Windows devices but ended up introducing a critical error on the servers that resulted in widespread crashes of systems around the globe.
2. How did the outage affect airports and airlines?
Thousands of flights were canceled and customers experienced delays at airports. Airlines like Delta Air Lines and American Airlines faced severe system crashes, which led to customers being frustrated because of the long queues at airports.
3. What impact did the outage have on businesses?
Several companies experienced order management and payment processing issues due to interruptions in their operation caused by the outage. Service providers and retailers, such as FedEx and Starbucks were affected and they struggled with transaction processing and delivery delays.
4. How did the IT outage affect healthcare services?
Healthcare facilities, such as Memorial Sloan Kettering Cancer Center and Brigham and Women’s Hospital, faced challenges accessing electronic medical records and had to revert to manual operations. They experienced routine care disruptions and had to cancel non-urgent procedures.
5. What steps are being taken to fix the issue caused by the outage?
CrowdStrike has been working tirelessly and they have deployed a fix for the previous software defect. The recovery is an ongoing process and requires manual reboots of affected systems. They have expanded their efforts to restore systems to normalcy at the earliest convenience.
6. How can organizations prevent similar outages in the future?
As part of the recommendations, companies must ensure to maintain transparent communication during the disaster recovery process to keep affected stakeholders updated about the problem and expected times the problem will be addressed. Organizations can guard themselves against similar outages by investing in robust backup systems, improving their testing protocols for software, and implementing incident response plans. They should consider diversification of their technological solutions to many service providers and implement advanced cybersecurity measures to mitigate risks such as these.
- 5 Best Mobile POS Systems of 2024 - August 31, 2024
- What is Call Center Technology? – Main Trends and Examples - August 31, 2024
- What Are The Top 7 Digital Customer Experience Strategies for 2024 - August 28, 2024