When the Unthinkable Strikes: Key Strategies for SaaS Providers To Navigate Cybersecurity Risks and DevOps Failures

What happens when the impossible becomes reality? In the world of cybersecurity and software as a service (SaaS), “impossible” is not something organizations can leave to chance. The July 2024 CrowdStrike incident, which rendered approximately 8.5 million computers running Windows inoperable through an endless cycle of rebooting, has placed a spotlight on large SaaS providers that organizations around the world rely upon.

In light of that event, the ever-growing list of cyberthreats, and myriad technical issues that can compromise businesses’ ability to function, it is critical that SaaS providers enact and refine modern DevOps processes to deploy changes effectively while reducing their impact if something goes wrong. There are a few key process changes and enhancements a company should confirm a SaaS provider has in place:

  • Staggered deployment: Rather than pushing software and definition file updates out to all customers at the same time, this process releases updates over a set time period, limiting the number of users affected by any issues present in the new files.
  • Enhanced monitoring and logging: By actively monitoring processes and adding notifications when there is activity that is not expected, SaaS providers can detect future failures more easily and work to correct them.
  • Adding update controls: Offering end-users control over how and when to apply updates helps to reduce the risk of critical systems failing without backups to rely upon, and it enables users to segment their devices so they can update in phases, potentially avoiding a complete system failure.
  • Resiliency: Having the proper plan(s) implemented and tested to allow recovery from an issue with little to no impact to customers helps reduce the risk of system downtime. The plan may include backup and recovery, systems redundancy, and processes to provide high system availability.

These core DevOps practices can help SaaS providers mitigate some of the inherent risk when making changes to their environment, updating systems, or rolling out changes to the customer’s systems. 


Small Steps Instead of Giant Leaps

SaaS companies may provide larger, but less frequent, updates to their platforms as a means of delivering comprehensive changes in a cost-effective manner. When those updates work as planned, that method can reduce disruption while giving the SaaS provider the time needed to implement new, large-scale options, solutions, or features and functionality for their user base. This mindset is central to a high-performing DevOps strategy.

But when a large update goes wrong, there can be a more significant impact to the user base than when changes are released through smaller, more frequent updates. Although more frequent updates require more action from users, smaller updates tend to be quicker and less disruptive than larger ones. The more frequent updates usually contain fewer changes, so if something within the update process does not work as intended, the likelihood that the change renders an entire platform, operating system, application, or production environment unusable is reduced. When combined with staggered updates, smaller ones also help SaaS providers make adjustments before continuing their rollout. In this way, it allows companies to remain agile.

Other techniques can also reduce the risk to SaaS providers and their clients during the update process, including blue-green releases and canary deployments, which can lessen or reduce the risk of downtime while allowing monitoring and changes to the update process should any further adjustments be required.


Telemetry Tells the Story

Staggered deployments and agility are only helpful if an organization has data to act upon. Without telemetry providing transparent insights, SaaS providers can be left without critical information that can help the team managing the update mitigate issues before they turn into full-scale problems. To aid support staff in fixing those issues, telemetry should be clear, concise, and actionable. This helps emphasize a proactive approach by addressing issues as soon as possible rather than reacting after damage has already occurred.


Staying on the Front Foot

Proactivity should not only be limited to telemetry. SaaS providers should encourage a mature culture of risk mitigation, quality assurance, and product enhancement on an ongoing basis. In that way, even when an unexpected, adverse event occurs, the organization has already established a culture of vigilance to respond quickly, effectively, and decisively.


How BDO Can Help

The only question worse than asking, “What happens if the unthinkable happens?” is asking, “What happens if the unthinkable happens again?”

Getting an external, independent perspective can assist your organization in identifying possible areas of improvement that may have been overlooked. BDO’s team of experienced, knowledgeable professionals can provide guidance offer options to consider to proactively enhance DevOps, site reliability engineering, and other holistic operating processes and procedures.