Is Business Continuity Management Still Relevant?

Author: Spiros Alexiou, Ph.D., CISA, CSX-F, CIA
Date Published: 1 May 2022
Related: IT Business Continuity Audit Program | Digital | English

Business continuity refers to the ability of organizations to minimize losses and keep functioning and fulfilling their missions, even under extremely adverse circumstances. But this is costly and, in current times, where cost cutting is a key concern, there is pressure to cut down on things that do not contribute directly to the bottom line. However, business continuity plays an essential role in today’s landscape.

The International Organization for Standardization (ISO) defines “business continuity” as the “capability of an organization to continue the delivery of products and services within acceptable time frames at predefined capacity during a disruption.”1

As shown in figure 1, normal operations correspond with little stress (i.e., those under the green line). Higher-stress situations gradually build to the emergency phase and, if not contained, can evolve into a crisis and peak at that stage. Business continuity management is the phase between the peaking of the crisis and back-to-normal operations.

In a crisis—that is, when a disruptive event with serious adverse effects occurs—the main priorities are:

  1. Protection of human life
  2. Protection of property, if possible (e.g., removing high-value assets in case of fire)
  3. Containment and end of the event (e.g., extinguishing the fire)

Business continuity then becomes the focus, including resuming an adequate level of business activities after, or even possibly during, the disruptive event (e.g., in the containment phase).

Risk Addressed by Business Continuity Management

Business continuity management (BCM) concerns existential risk to an organization and, ideally, the rare materialization of such risk. Natural disasters and cyberattacks are not the only existential threats; creative accounting and other manipulations have proven to be just as damaging.2

Nevertheless, risk that is knowingly assumed is rarely touched on by audit in practice because management (hopefully and wishfully) views consequences as rare, and few organizations have achieved sufficient audit independence for auditors to act freely. However, business continuity is an audit concern.

In principle, there are four main ways to address risk:

  1. Avoidance—Although in business there is risk that can be avoided. Avoiding the risk of a disruptive business-threatening event altogether is not an option for any organization. There are simply too many variables outside of an organization’s control that can create a disruption.
  2. Transfer—Insurance-type solutions are certainly an option, and various forms of insurance are now available, such as cyberinsurance.3 One problem with this option is that insurance does not cover everything, and it is by no means certain that if a disruptive event occurs, it will be the type of event the insurance contract covers. Insurance offers some financial protection against specific risk, such as fires or floods. However, some insurance solutions (i.e., cyberinsurance) are not well tested in practice since a disaster potentially could require rebuilding the entire organization. Entrusting the organization’s future to litigation is always risky. Insurance solutions may be part of a business continuity strategy, but they are rarely the sole or even main component. In some cases, certifications based on industry, notably the shipping industry, or in relation to certain types of risk, such as fire, may warrant premium reductions. However, it is still unclear whether spending the resources required to obtain a business continuity certification is more effective than investing the same resources in prevention or recovery readiness efforts.
  3. Mitigation and Acceptance—Risk reduction or mitigation is a valid and important approach and, ideally, it should be applied in a cost- effective way. However, contingency planning is not obsolete; it is a last line of defense after controls, no matter how strong they were thought to be, have failed. For instance, many organizations employ failover clusters and various types of synchronous or near- synchronous replication. (The difference is whether there is confirmation of reception of the data or actual replication of the data in the backup site.) That means that when a natural disaster wipes out a data center, for example, IT operations can continue seamlessly from the other nodes. This is a valuable mitigation strategy in many cases, but it does not equal an effective disaster recovery (DR) plan. For instance, if ransomware infects the main site and is being replicated to the failover sites, this solution is no longer viable. Checking everything that is being replicated for possible malware may be beyond the capabilities of a replication scheme. A solution might then be dated, ransomware-free backups, which, ideally, have undergone testing.

Another extreme possibility is that a disruptive event, such as a major fire, incapacitates a utilities network, and the failover fails. Few organizations are likely to have bothered with backups when they just implemented a failover, yet this is the type of situation where DR and BCM come into play: rare, once-in-a-blue-moon situations when everything else fails in such a way that the organization’s survival is at stake.

While risk management considers the likelihood of scenarios, business continuity considers the effects of a disruptive event regardless of the cause or the likelihood of the cause.

This is also why risk acceptance is problematic. Probabilistic cost-benefit reasoning does not apply here. For example, say it will take 30 days and cost X amount of money to get an old, abandoned warehouse to function as a new data center (assuming backups are available), while it will cost 200X to create a DR solution. So, if the probability of an adverse event is one in a million and risk is cost (impact) x probability, it is much cheaper to pick the first option and not spend time and effort on having a DR plan in place. However, this reasoning is faulty—first, because statistics reveal nothing about a single entity (an individual enterprise only cares about protecting itself) and second, because after 30 days there may not be many customers left and regaining them will very likely take much longer than 30 days. While risk management considers the likelihood of scenarios (e.g., the risk of a flood or fire damaging IT operations), business continuity considers the effects of a disruptive event (e.g., loss of the IT site), regardless of the cause (e.g., flood, fire) or the likelihood of the cause.

Cloud solutions offer a number of advantages in terms of risk mitigation. They can be used to improve resilience and possibly aid business continuity, but they are not problem free either. The recent SolarWinds incident4 demonstrated how complex the interdependencies are and how a breach in one trusted enterprise can affect many others. It is no secret that public cloud services are an attractive target.5, 6, 7 In addition, cloud contracts often provide either no compensation8 or compensation that is extremely low9 to cover adverse events, even those due to the cloud provider’s fault—something that would be very hard to prove in any case. Also, for a number of technical and legal/regulatory reasons, cloud premises may be geographically close, hence, possibly affected by the same adverse event that disrupted operations running on premises.

The COVID-19 pandemic is a painful reminder that a tightly interconnected world comes with interconnected problems, and it is difficult to contain a true disaster within regional or national borders.

Business Continuity Is More Than IT

The threats an enterprise faces are not limited to loss or compromise of its IT assets, although IT is definitely a significant component. It may, perhaps, be the dominant component, such as for an organization selling IT services. For other organizations, the dominant component may involve production facilities, storage facilities or even partner facilities. For organizations that adopt the just-in-time (JIT) model10—that is, they do not keep stock, but instead rely on their partners to deliver as demand dictates—the consequences of a partner suffering a catastrophic event can be devastating.

Today, a great deal of production takes place in countries where the cost of production is more cost effective. There are often concerns over political stability, and often the civil infrastructure in those countries is weaker. A catastrophic event in the country where production is taking place can have a direct, devastating effect on production and storage  of goods. It can also indirectly affect the entire supply chain, leading to devastating consequences because the products directly affected may be a component of other products.

Because insurance-type solutions and service level agreements (SLAs) typically do not cover force majeure and many have not been thoroughly tested, they will generally be unable to promptly restore production. BCM in these circumstances might involve having agreements with alternate providers—preferably located in a different region or country (and having tested their products)—or keeping extra stock, even if the organization embraces the JIT model.

Current and Emerging Threats

Man-made disasters are a very real possibility. They can result both from internal organizational activities, as in the creative accounting example, and from external factors, such as political pressure and legal  or regulatory issues. Organizations know that external disruptions are a possibility; therefore, the best approach to be prepared is a non-IT-related BMC plan.

Pandemics and COVID-19
The COVID-19 pandemic is a painful reminder that a tightly interconnected world comes with interconnected problems, and it is difficult to contain a true disaster within regional or national borders. While there are many positives to having such an interconnected world, in this case, that also meant that containment of the pandemic is much harder.

Another example of a trend that has both pros and cons is centralization, which has many advantages, but carries the risk of single point of failure (SPOF).

Pandemics, such as the ongoing COVID-19 crisis, are a category of disruptive events that BCM must take into account because they have specific features that affect business differently, such as:

  • In contrast to other disruptive events, such as floods or earthquakes, pandemics can last much longer.
  • Due to their global nature, the assumption that some sites and some personnel will be available is not necessarily true.
  • Because they are pervasive, their effects are generally more serious and less predictable than those of other disruptive events. For example, economic activity throughout the world may slow down, which may affect the delivery of goods and services on which an organization relies. The availability of personnel could also have a negative effect.

The severe acute respiratory syndrome (SARS) outbreak of 2003, which spread in five countries, was the first serious transmissible new disease of the 21st century.11 As a result of SARS, there were some processes already put in place that aided some organizations’ COVID-19 responses, including:

  • Significantly better infrastructure and readiness for remote work
  • Higher penetration of cloud services, including backup solutions
  • More widespread JIT and lean production models
  • More government readiness to support businesses

However, there were also process in place that resulted in more negative effects with the COVID-19 pandemic, including:

  • Many work environments used open space/open office12 workspace concepts, resulting in employees working in closer proximity to each other.
  • There was greater geographical interconnectedness among customers, partners and workers.
  • The cyberenvironment often affected the security of remote work. Phishing scams—which are always a threat, especially with the evolution of deepfake technologies—multiplied as remote work provided more opportunities for attackers. Although there was no shortage of meeting and videoconferencing tools, not all of them were built with enterprise security specifications.13 Naturally, as threats evolve, so do defenses. For instance, immutable storage was used to protect against ransomware.14 However, defenses add cost and may not always be practical.
When IT continuity is the issue, business continuity is, at least initially, about meeting properly set recovery time objectives.

The main response to the COVID-19 crisis was, at least initially, to shift to remote work, when possible, and to institute protective measures when not. Some organizations started charting the skills of their personnel to find the best replacements in their internal workforce if certain staff members should become unavailable.

Remote work was not an a priori given. Many organizations had set their virtual private network (VPN) dimensions to support a much smaller remote workforce (i.e., 10–20 percent of the total workforce). With the possibility that home PCs might be unpatched, endpoint security was another concern.

Climate Change
Another important risk is climate change. In many cases, organizations find that previously robust support for critical functions such as IT is no longer adequate because of climate change.

For instance, IT depends on air conditioning (AC), and prolonged extremely hot temperatures in the summer can rapidly strain AC performance, not to mention drastically increase power consumption. These problems can be exacerbated by prolonged power failures that auxiliary power systems cannot alleviate. Summer fires and winter floods can result in prolonged power failures, even if an organization’s critical infrastructure is not directly affected.

Meeting the Challenges of Ensuring Business Continuity

International Organization for Standardization (ISO) standard ISO 22301 Security and Resilience—Business Continuity Management Systems—Requirements proposes a circular plan-do-check-act approach in a full BCM system to combine improved resilience with the capacity to promptly and adequately respond to crises in a way that ensures organization survival.15 This remains the best approach. It involves planning for the necessary measures (plan), implementing them (do), verifying the extent to which they work (check), and correcting and improving any areas where resilience or response are found—or in the future may be found—lacking (act).

RTOs and RPOs
When IT continuity is the issue, business continuity is, at least initially, about meeting properly set recovery time objectives (RTOs) and recovery point objectives (RPOs)—that is, getting back to the operation of all essential functions in an acceptable time frame. An RTO specifies the time in which essential functions must be back up and running—perhaps not as efficiently or securely as before the incident but running nonetheless—while limiting important data loss to an acceptable amount. An RPO specifies the amount of critical data, such as billing records, an organization can afford to lose due to the incident. If a failover solution has been implemented and works as intended during the event in question, there is basically no failure and no data loss.

An incident such as this must be analyzed, and conclusions may be drawn from it, but it is neither a disaster nor a BCM failure. It is essentially the same as a foiled cyberattack. The defenses designed for everyday operations held. It would become a disaster or business continuity disruption if the everyday operational defenses failed. An example of a disaster would be ransomware infiltrating an organization’s main sites and successfully replicating in its failover sites.

An organization’s RTO and RPO are determined by a business impact analysis (BIA) of all enterprise processes, including supporting systems and dependencies. Given the existing alternative solutions (such as working from spreadsheets), two questions must be answered:

  1. How long can the organization live without system X?
  2. How many hours or days of data in system X can the organization lose and still survive with tolerable losses?

The answers are important because backing up or duplicating everything generally adds to the cost of a BCM or DR plan, and the frequency and quantity of tolerable data loss are critical factors affecting cost. Furthermore, some data may not be necessary for the organization’s survival. For instance, the enterprise Intranet is typically less critical than billing.

It is true not only that all enterprise functions are not of equal value, but also that they have unequal backup needs. For example, legal documents typically do not change every minute and generally do not need to be available at all times. On the other hand, billing records should be available at the time billing is to be done, and they typically change as fast as sales occur.

It is true not only that all enterprise functions are not of equal value, but also that they have unequal backup needs.

Testing Remains an Inconvenient Necessity

When planning for cases in which everyday operational defenses do not hold, it is imperative to consider the entire process. It is not enough, for instance, that data and machines be available if the necessary personnel are not—or if they are unable to get to the alternative site set up to respond to such an event.

The COVID-19 pandemic illustrated the need for a remote work infrastructure that can be leveraged to provide additional benefits, such as reduced operating costs due to decreased need for electricity and staff availability to man key posts remotely.

Planning for cases in which the defenses do not hold brings up what is possibly the most challenging part of BCM: realistic testing. It is hard enough to get an organization to spend resources on what is regarded (hopefully) as a very unlikely disastrous event that likely would also affect competitors. But it is at least as hard to get people to practice for such an improbable event under conditions that might be realistic only if the unlikely event should materialize. In the case of the COVID-19 pandemic, many organizations used remote work tools (including security) that they evaluated and tested for the first time during the emergency.

When auditing business continuity, the auditor has a chance to pinpoint shortcuts and deficiencies. Deficiencies in testing are a prime candidate for identifying important problems. Because testing is difficult and, to a certain extent, not as well recognized as other activities, assumptions tend to be made.

Sometimes organizations take for granted the utilities providing them with power and telecommunications, for example, or the services that ensure the functioning of various IT infrastructure components. The result is that partial testing is often performed. Finding ways to circumvent activities that are difficult and boring is a natural human tendency, but inadequate testing is a general red flag for audit.

Lack of a clear chain of responsibility is another. During a crisis, customers, partners and employees need to receive timely, authoritative and clear information about the situation, and they need to know what they are expected to do and how to do it.

Conclusion

Business continuity will always be relevant and important because it helps address existential risk to an organization. And, as such, it will remain an important audit topic. The risk is not going away; it is more likely that new risk will be added. And it is not only about IT, as risk to the existence of an organization is not limited to IT failures. Although measures such as failovers help mitigate the risk of a disruptive event happening, business continuity is essential for when a disruptive event does happen to have a viable and tested plan to minimize the losses, resume critical operations in an acceptable time and ultimately return to normal operation. Audit in particular needs to retain a full picture of the risk and be skeptical of miracle solutions. Often new risk is introduced that can cause more damage because it has not been taken into account.

Endnotes

1 International Organization for Standardization (ISO), ISO 22301:2019 Security and Resilience—Business Continuity Management Systems—Requirements, Switzerland, October 2019, www.iso.org/standard/75106.html
2 Gupta, R.; “Creative Accounting Practices: A Case Study of Enron and Satyam Scandals,” International Journal of Research and Analytical Reviews, October–December 2018, http://ijrar.com/upload_issue/ijrar_issue_20542176.pdf
3 Embroker, “Cyber Liability Insurance,” www.embroker.com/coverage/cyber-insurance/
4 New York State Department of Financial Services, Report on the SolarWinds Cyber Espionage Attack and Institutions’ Response, USA, April 2021, www.dfs.ny.gov/system/files/documents/2021/04/solarwinds_report_2021.pdf
5 Kass, D.; “Public Cloud Cybersecurity and Cyberattacks: Research Findings,” MSSP Alert, 9 July 2020, www.msspalert.com/cybersecurity-research/public-cloud-cyberattack-trends/
6 Nichols, S.; “New RAT Campaign Abusing AWS, Azure Cloud Services,” SearchSecurity, 12 January 2022, www.techtarget.com/searchsecurity/news/252511913/New-RAT-campaign-abusing-AWS-Azure-cloud-services
7 Seals, T.; “Amazon, Azure Clouds Host RAT-ty Trio in Infostealing Campaign,” Threatpost, 12 January 2022, https://threatpost.com/amazon-azure-clouds-rat-infostealing/177606/#:~:text=Cyberattackers%20are%20abusing%20Amazon%20Web,sensitive%20information%20from%20target%20users
8 Amazon Web Services (AWS), “AWS Customer Agreement,” https://aws.amazon.com/agreement/
9 Microsoft, “Microsoft Cloud Agreement,” 2017, https://download.microsoft.com/download/2/C/8/2C8CAC17-FCE7-4F51-9556-4D77C7022DF5/MCA2017Agr_EMEA_EU-EFTA_ENG_Sep20172_CR.pdf
10 Banton, C.; “Just-in-Time,” Investopedia, 1 December 2021, www.investopedia.com/terms/j/jit.asp
11 World Health Organization, “Severe Acute Respiratory Syndrome (SARS),” www.who.int/health-topics/severe-acute-respiratory-syndrome#tab=tab_1
12 Bernstein, E.; B. Waber; “The Truth About Open Offices,” Harvard Business Review, November–December 2019, https://hbr.org/2019/11/the-truth-about-open-offices
13 Scroxton, A.; “Zoom Rapped Over Historic Security Practices,” Computer Weekly, 10 November 2020, www.computerweekly.com/news/252491827/Zoom-rapped-over-historic-security-practices
14 Weiss, D.; “What Is Immutable Cloud Storage?” Datto, 11 June 2021, www.datto.com/blog/what-is-immutable-cloud-storage
15 Op cit International Organization for Standardization

SPIROS ALE XIOU | PH.D., CISA, CSX-F, CIA

Has been an IT auditor at a large company for the last 14 years. He has more than 25 years of experience in IT systems and has written a number of sophisticated computer programs. He can be reached at spiralexiou@gmail.com.