The First Steps of Quantitative Risk Management

Author: Ignacio Marambio Catán, CISA, CRISC, CEH, CISSP, Security+
Date Published: 1 May 2019

Things have been changing in the risk management arena these last few years. Quantitative risk management methods such as those based on risk matrices (a grid-based representation of the risk scenarios to which an enterprise is exposed) are slowly retreating to give room to new quantitative methods of evaluating risk such as Applied Information Economics (AIE)¹ and Factor Analysis of Information Risk (FAIR).² The reason is simple: Risk matrices do not really work. Worse, they lead to a false sense of security. That said, there are other quantitative-based methods based on sound statistical and theoretical grounds, and they have actually been used successfully by actuarial professionals for years.

As to why risk matrices do not work, the reasons are numerous. One of the reasons for their inability to work is that most risk matrices have two dimensions. The first is impact and the other probability, and they are helpfully using data binning (grouping information in a range by representing it with a representative value within that range) in a scale, something along the lines of low, medium and high, and herein lies the problem. The meaning of these words is usually not clearly defined and, as a result, what might be of high impact to one stakeholder might not be for another. Related to this challenge is that, even if an explicit probability is given, it is usually not placed in a proper time frame and, if it is, it is not clear how this matrix deals with events happening more than once in a given period.

Another problem is that humans are poor at predicting the consequences of their actions or the risk those actions entail due to certain cognitive biases, such as the gambler’s fallacy, confirmation bias, availability heuristic and many others.³

Examples of Cognitive Bias

Gamblers Fallacy: When doing experiments with a fair coin, it is expected that tails will show up half the time. Intuition would dictate that if tails happened several times in a row, then heads would have to happen more often than heads in the following coin tosses to even the odds. However, since every coin toss is not in any way related to the earlier experiments, this is not the case.

Confirmation Bias: Belief is a powerful thing, so much so that people tend to remember more easily evidence that supports that belief or they reinterpret existing evidence to that same end. People also tend to get (and believe) information that comes from sources with which they share beliefs.

Availability Heuristic: People tend to recall recent events more easily which is why they also tend to give more weight to evidence that can be recalled more easily.

A First Example in R, Why Risk Matrices Do Not Work

The best way to show that a certain method does not work is to just use it and check whether it suits reality. The reason this approach works is that any method used to measure risk is just a model. R will be used to test this method not only because it is free, but also because of its extensive array of libraries to do statistical work and exploratory analysis through plots. The last reason R will be used is simply because it is an interesting programming language.

Models

A model is an approximation of reality. Models rely on boundaries, variables and relations among variables. Risk matrices are such a model. A model is a good model when it has predictive power, that is, when the user of the model plugs in values that are within the boundaries into the variables and the model matches the reality.

As an example, Newtonian gravity is a good model to predict the orbits of most of the planets (except for Mercury),⁴ but it is not useful to predict what happens to matter falling into black holes at near the speed of light. An extreme example of this is called the spherical cow, a humorous metaphor of a model so simple that it can be applied to predict reality within so small a boundary that it makes it unfit for everything. It is a joke in which a physicist attempts to explain a method to improve dairy production with a constraint. It only works with spherical cows in a vacuum. The model might be perfect, but those conditions are not what reality dictates.

Black Swans

There are certain events that are unlikely enough that most statistically based models fail to predict their outcome (e.g., the 2008 financial crisis, Hurricane Katrina, the Fukushima Daiichi accident), but their impact is so massive that organizations should take them into account regardless of their likelihood. These are called black swans,⁵ and they have three properties:

They are rated by the observer as having so low a likelihood that they are meaningless to the observer.
Their impact is huge.
They are rationalized after the fact.

To illustrate these properties, the Tokyo Electric Power Company (TEPCO) engineers who modeled the possibility of major earthquakes in the area to assess the risk to the Fukushima power plant decided to use evidence from 1896 onward, ignoring evidence such as the 869 AD Sanriku 8.4 magnitude earthquake. This oversight produced a model that was accurate most of the time, but that failed badly on 11 March 2011. The model might have been perfectly fine for those TEPCO engineers, but historians with knowledge of the Sanriku would have disagreed.

A Risk Matrix Model

This particular example assumes the use of a risk matrix such as the one in figure 1 that has two dimensions, impact and probability, and that both are measured as low, medium and high.

The first step to build this risk model involves a small transformation: clarifying what low, medium and high mean for both risk and impact. It can be assumed that low probability means a probability and a risk-related event happening in a given period is between 0 and 0.3 (meaning the event will happen between never and approximately once every three years), medium risk is between 0.3 and 0.7 (the event will happen between approximately once every and twice every three years), and high risk is between 0.7 and 1 (the event will happen between more than twice every three years and once every year). An event is of low impact if it costs between US $0-10,000; it is of medium impact if it costs between US $10,000-20,000; it is of high impact if it costs between US $20,000-30,000. This is just an example, so these numbers need not be realistic, at least to prove this point: A real organization would derive these numbers from its risk appetite, i.e., the level of risk the organization is willing to accept. It also presumes that these probabilities are evenly distributed.

This risk matrix is now used as an urn to draw events (think of the risk matrix as a die and each face of the die represents a place in the matrix; this die will be used over and over again to simulate a risk that materializes), and the yearly cost of the event is calculated using the formula Yc=P.I where Yc is the yearly cost and P and I are the likelihood and the impact. This is akin to how annualized loss expectancy is used in financial risk management. The histogram in figure 2 is the result of repeating this process thousands of times.

Result
As can be seen, there are events that were labeled as high risk that are binned in the same place as a few that are classified as low risk. And low- and medium-risk events are heavily overlapped, which means that risk matrices have very little predictive power, at least when it comes to risk measurements. The code to reproduce the plot is in figure 3.

Data Center Downtime, a Small Quantitative Risk Analysis Example

This second model relates to a small data center that has no uninterruptible power supply (UPS), which is a device that is used to temporarily power devices in a data center during an outage. In the risk matrix days, this would probably, depending on the criticality of the services provided by the affected servers, be mapped as a high impact or a low or medium likelihood scenario, and would result in new hardware expenditure. But as has been seen, risk matrices do not mean much. Another model is needed. One that uses probability distributions.

Probability Distributions

Simply speaking, given a set of events that accounts for the totality of possible outcomes for a given situation, also called the sample space, a probability distribution is such that it assigns a positive value to each outcome so that, when added, all those values equal 1.

There are many such distributions, some can be seen in figure 4, and they can be used to model rates of errors, heights, weights, population numbers, etc. For example, the height of a person can be modeled using a normal or Gaussian distribution (a probability distribution that looks like a bell; an example can be seen in figure 4), along with some parameters such as the average height and standard deviation.

This model uses several of these probability distributions. To model how many outages have occurred in a year, a Poisson distribution is used.⁶ To model the duration of each outage, a log-normal distribution is used, and to model the cost of each outage, a normal distribution is used. The model is, of course, not perfect; it ignores the hour of the day or part of the year, which, for a retailer, for example, is critical.

The normal distribution as used here to predict the length of the outage might do well most of the time, but it will fail to predict long outages because it assigns insignificant probabilities to events that are five or six standard deviations further, which is why distributions that have longer tails might be used.

To illustrate another point, this model, plotted in figure 5, goes a step further and tries to analyze what will happen if the organization finally decides to buy and install a UPS. The cost of maintaining the UPS will be ignored but could be modeled as a constant.

Result
A few other things were assumed, such as how many outages happened yearly on average, the average outage time in hours and the cost per hour of outage, all of which could be roughly calculated by either asking the right person or consulting historical outage data, if those data exist. The plot shows how much money the organization might lose due to outages on average and even a worst-case scenario. It also shows how much of an improvement the investment in a UPS for the data center might bring. An easy calculation would show how long it would take for the UPS to pay for itself in saved costs. In any case, the results of this model are given in cost per year, which is much more meaningful to stakeholders than high, medium or low. Once again, the code to repeat the analysis is shown in figure 6.

Decomposition

The model used with this problem can be seen as a distribution in which the yearly cost of outages can be viewed as a function of how many outages happen in a given year, how long each is and the cost of each outage hour; that is, the outage model is dependent on three variables, each with its own associated probability distribution. Breaking a problem into many simpler problems is called decomposition.

Accuracy and Precision

When measurements are being performed, care must be taken to also measure the accuracy and precision of a given instrument.

Consider someone who is 1 meter and 69 centimeters tall. If that person were to measure themselves with a meter stick, they may get the result of 1 meter 73 centimeters and 2 millimeters, which would be a very precise result, but would not be very accurate. However, if the result was between 1 meter and 60 centimeters and 1 meter and 80 centimeters, it would be accurate, but not very precise. Not having precision and accuracy in a measurement leads to uncertainty, and uncertainty, in this case, would be propagated.

Uncertainty Propagation

When modeling a scenario where decomposition is used, care must be taken with regard to uncertainty propagation, i.e., the variables used to perform the decomposition have to have an uncertainty smaller than expected of the result, often much smaller.

The following is a formula that can be used to estimate the resulting uncertainty Δy when y is a function of Χ = (Χ₁, Χ₂…Χ_n)

The Rule of Five

One of the usual reasons proponents of qualitative risk analysis tend to dismiss quantitative risk analysis is that there is often not enough information to create a proper model. What proponents of qualitative risk analysis usually do not answer is how qualitative analysis does a better job with less information. Qualitative analysis does not do a better job with less information as can be seen using the rule of five, which states that “There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.”⁷

The validity of this rule can be shown statistically, but since the Monte Carlo method was used to perform the analysis and draw conclusions, it will be used to show the validity of the rule. As has been done before, an experiment will be repeated many, many times and its results will be tallied. This experiment seeks to determine whether the true mean of a population can be estimated by using just five members of said population.

To simulate populations, a probability distribution will be used, and from every population, a sample of five members of each and its mean will be calculated. If the mean of each population falls within the range (that is it is smaller than then the largest of the five samples and larger than the smallest), then the instance of the experiment will be successful; otherwise it will have failed. The percentage of times the experiment was successful will be calculated and plotted, and this will be repeated many, many times with populations of different sizes. The resulting plot can be seen in figure 7.

Slightly modified code from the blog of Data Driven Security⁸ was used for this demonstration. Changes were made only to parallelize the calculation and to use less iteration; both to save time, but the results are the same.

Unanswered Questions

This is just a small part of what can be done using probabilistic methods to measure risk, but there is much left to investigate. As examples, the following should be considered:

Experts provide important, often invaluable information. How do they enter into the quantitative risk management picture?
Enterprises are moving targets. How are quantitative risk management models updated over time?
Where would a risk analyst start assessing the risk of an enterprise?
How would an analyst decide what to measure first to decrease uncertainty?
How can an analyst use this information to build a risk portfolio?
How are human behavior and errors modeled?
How are software failures modeled?

All these questions have answers to a varying degree and should be investigated further.

Conclusion

There are better alternatives to risk matrices and, with a little time and effort, it is possible to manage risk using terminology and methods that everyone can, at least intuitively, understand. These methods are neither new nor poorly understood as they have been used for decades in fields such as aerospace, nuclear, financial and insurance industries with relative success. IT is no more complex than these fields, but the risk management frameworks IT relies on to this day are sorely lacking. It is time to grow.

Endnotes

¹ Hubbard Decision Research, “Applied Information Economics (AIE),” https://www.hubbardresearch.com/about/applied-information-economics/
² The Open Group, “Open FAIR Certification,” www.opengroup.org/certifications/openfair
³ Kahneman, D.; Thinking, Fast and Slow, MacMillan Publishers, USA, 2013
⁴ Lemmon, T.; A. Mondragon; “Kepler’s Orbits and Special Relativity in Introductory Classical Mechanics,” 21 April 2016, https://arvix.org/pdf/1012.5438.pdf
⁵ Taleb, N.; The Black Swan, Penguin, 2008, UK
⁶ Li, H.; L. Treinish; J. R. M. Hosking; “A Statistical Model for Risk Management of Electric Outage Forecasts,” IBM Journal of Research and Development, vol. 53, iss. 3, July 2010, https://www.researchgate.net/publication/224138151_A_statistical_model_for_risk_management_of_electric_outage_forecasts
⁷ Hubbard, D.; How to Measure Anything: Finding the Value of Intangibles in Business, Wiley, USA, 2007
⁸ Jacobs, J.; “Simulating the Rule of Five,” Data Driven Security, 16 November 2014, http://datadrivensecurity.info/blog/posts/2014/Nov/hubbard/

Ignacio Marambio Catán, CISA, CRISC, CEH, CISSP, Security+
Has been an IT consultant for half of his professional career, after which he performed security operations, where the main objective was compliance. He also held audit-related positions with a focus on risk measurements. It was during his time performing security operations that he became a believer in automation, which later was used while auditing systems. Lately, Catán has been reading about machine learning and completed a Microsoft Professional Program in Data Science. In 2016, Catán was awarded the Certified Information Systems Auditor (CISA) Geographic Excellence Award.