Skip to content

Post-Mortem: Navigating today's CrowdStrike Outage with Effective Disaster Recovery Strategies

This morning, the IT world experienced a significant disruption due to a widespread outage involving CrowdStrike and Windows environments. Many businesses found themselves grappling with unexpected downtime, causing a ripple effect across various industries. However, at Barricade Networks, we are proud to report that our operations—and those of the businesses we support—were swiftly restored, minimizing downtime and impact.

It is worth noting that CrowdStrike is a fantastic product that we love using as endpoint protection for larger clients. Their Falcon Complete system gives unparalleled protection against many of the technology "big bads" that exist in our world today. It is only one tool in the belt though and today is a good reminder of that.


Understanding the Outage

The CrowdStrike outage posed a significant challenge for IT infrastructures worldwide. As security solutions and operating systems became temporarily inoperative, businesses faced immediate threats to their data integrity and operational continuity. Such disruptions can lead to severe consequences, including data loss, financial setbacks, and reputational damage.

Our Resilience Strategy

At Barricade Networks, we prioritize preparedness and resilience. Our rapid recovery from this outage was not a stroke of luck but a testament to our comprehensive disaster recovery strategies. Here's how we managed to bounce back effectively:

1. Proper Backups

We have always emphasized the importance of regular, secure backups. Our robust backup protocols ensure that we can restore critical data and systems swiftly in case of any disruption. These backups are stored in multiple secure locations, both on-premises and in the cloud, ensuring that we always have access to the latest data snapshots.

2. Multiple Secure Access Methods

Redundancy is a cornerstone of our IT strategy. By implementing multiple secure ways of accessing our environments, we ensure that our teams can continue their work even when primary systems fail. This includes secure VPNs, remote desktops, and alternative cloud-based solutions. These measures enabled us to maintain operational continuity while we addressed the outage.

3. Chain of Command and Disaster Recovery Plan

Our success in navigating this crisis can also be attributed to our well-defined chain of command and disaster recovery plan. Every team member understands their role in a crisis, ensuring swift decision-making and coordinated efforts. Our disaster recovery plan is regularly updated and tested, allowing us to respond to emergencies with confidence and efficiency.


Purported Fix

CrowdStrike has a fix on their blog here. Here are the workaround steps from that article as of this posting:


Workaround Steps for individual hosts:

  • Reboot the host to give it an opportunity to download the reverted channel file. If the host crashes again, then:
      • Boot Windows into Safe Mode or the Windows Recovery Environment
        • NOTE: Putting the host on a wired network (as opposed to WiFi) and using Safe Mode with Networking can help remediation.
      • Navigate to the %WINDIR%\System32\drivers\CrowdStrike directory
      • Locate the file matching “C-00000291*.sys”, and delete it.
      • Boot the host normally.

    Note: Bitlocker-encrypted hosts may require a recovery key.

Workaround Steps for public cloud or similar environment including virtual:

Option 1:

    • ​​​​​​​Detach the operating system disk volume from the impacted virtual server
    • Create a snapshot or backup of the disk volume before proceeding further as a precaution against unintended changes
    • Attach/mount the volume to to a new virtual server
    • Navigate to the %WINDIR%\System32\drivers\CrowdStrike directory
    • Locate the file matching “C-00000291*.sys”, and delete it.
    • Detach the volume from the new virtual server
    • Reattach the fixed volume to the impacted virtual server

Option 2:

  • ​​​​​​​Roll back to a snapshot before 0409 UTC.

If you need further assistance or encounter any issues while applying this fix, please do not hesitate to reach out to our support team at Barricade Networks.

Lessons Learned

While we are pleased with our rapid recovery, we also recognize the importance of continuously improving our strategies. Here are some key takeaways from today's incident:

  • Regular Testing: Continuously test and update disaster recovery plans to adapt to new threats and vulnerabilities.
  • Communication: Maintain clear and open communication channels with all stakeholders during a crisis to ensure transparency and coordinated efforts.
  • Investment in Security: Regularly invest in advanced security measures and backup solutions to safeguard against future disruptions.

Supporting Our Clients

Our commitment to our clients extends beyond our own operations. We have ensured that all businesses under our support umbrella experienced minimal downtime and quick restoration of services. By applying the same resilience strategies, we helped our clients navigate this outage with confidence and minimal disruption.


Conclusion

Today's CrowdStrike outage was a stark reminder of the importance of robust disaster recovery strategies. At Barricade Networks, our proactive approach and commitment to resilience ensured that we, and those we support, remained operational in the face of adversity. We will continue to refine our strategies, ensuring we are always prepared for the unexpected and can provide our clients with unwavering support.

If your business is still recovering from this outage or if you need assistance in bolstering your disaster recovery plan, Barricade Networks is here to help. Contact us to learn how we can support your IT needs and ensure your business's resilience in an ever-evolving digital landscape.