Please enjoy reading this archived article; it may not include all images.

Availability Risk Assessment—A Quantitative Approach

Date Published: 1 January 2010

Increased corporate governance requirements are causing enterprises to examine their internal control structures closely to ensure that controls are in place and operating effectively.1 Enterprises are becoming sensitized toward business risks and are actively engaging IS auditors and IT governance professionals to fulfill the governance, risk and compliance requirements. Enterprise risk management (ERM) frameworks in general and ISACA’s Risk IT: Based on COBIT® framework in particular are valuable contributions that are assisting practicing auditors and IT governance professionals in delivering standards-based audit programs and providing assurance on internal controls.

IS auditors and governance professionals are required to assess availability risk as part of the audit and review process, because system availability is an important parameter found in most ERM frameworks and IS security standards. The Risk IT framework considers “availability” risk as part of IT service delivery-related risks,2 and ISO 27001 considers it as part of overall security risk, where security risk consists of “confidentiality,” “integrity” and “availability”(CIA) risk.3

Currently, an availability risk assessment is done by conducting failure mode and effects analysis (FMEA)4 on the inventoried information assets. The FMEA exercise provides a risk value and risk priority number (RPN) for every item listed in the inventory. Risk is a function of likelihood and impact where “likelihood” is the frequency or probability of occurrence of the incidence and “impact” is the effect on business. By assigning a cardinal value for “likelihood” and “impact,” the risk value is determined using the equation: risk value = likelihood x impact. For example, if the likelihood that a system will be unavailable is scored at 3 in a scale of 0-6, and the corresponding impact is scored at 2 in a scale of 0-5, then the risk is valued as 6, which may be interpreted as medium risk. The risk value (i.e., 6 in this case) on its own does not mean anything. It only helps to relatively rank the availability risk of inventoried systems and services. As the scores for likelihood and impact are assigned intuitively, the risk value of the same asset is likely to vary across audits if the scores are assigned by different persons. A high-level survey of literature shows that system availability has attracted marginal attention by researchers.5 The proposed methodology is an attempt to bridge this gap by providing a quantitative approach for performing an availability risk assessment.

It can be argued empirically that the availability percentage of a system or service is a good measure to quantify availability risk. For example, if a system or service is rated for 99.5 percent availability, the risk is clearly reflected. The value can be used to calculate and commit uptime. The information systems (IS) auditor can audit the current availability percentage against the committed percentage and report accordingly.

The availability of a service depends on how often the service fails and how much time it takes to restore the service. The frequency of failure reflects the quality of the system, which is an offshoot of the system’s architectural capability, and the restoration time is dependent on the support capability. Mean time between failure (MTBF) measures average failure rate, and mean time to repair (MTTR) measures average restoration time. Using MTBF and MTTR, the availability percentage can be calculated as follows: MTBF / (MTBF + MTTR) x 100.6

This article puts forth a method for deriving MTBF and MTTR by assessing the system and support architectural capabilities, and then using this to calculate availability percentage.

Based on the principles established previously, the following structured approach is suggested for performing the risk assessment of IT systems using a quantitative method:
  1. Create a service catalog.
  2. Assess the system and service support capabilities.
  3. Calculate availability percentage.

Create a Service Catalog

The business views IT as a service provider. Taking a service-oriented approach to risk assessment7 enables the business process owner to relate the IT systems directly with the business area for which they are operating. This approach helps provide a business view of risk rather than a technology view. The proposed methodology uses a service catalog instead of an information asset inventory (Figure 1), which is the traditional approach followed in an IT risk assessment exercise. The service catalog is prepared by listing the services offered to the users from various IT systems. For example, the e-mail system might offer e-mail access using an Outlook client, web client or BlackBerry. Similarly, all the IT systems are scrutinized to create a comprehensive service catalog.

Assess the System and Service Support Capabilities

IT systems are essentially an outcome of the software engineering process. Research in the field of software engineering has established that software architecture has a decisive role in meeting various quality attributes, e.g., system availability. Research also prescribes the use of software architecture in evaluating quality attributes,8 such as availability, performance and modifiability.

The architectural approach to risk assessment provides a platform for deriving risk indicators for both existing and new systems. The Capability Maturity Model (CMM) developed by the Software Engineering Institute (SEI) of Carnegie Mellon University9 can be used as a reference to evaluate the system and support capability of a service by assessing their architectural maturity. The maturity level shown in Figure 2 is proposed to be used in this methodology.

Using the maturity levels shown in Figure 2, the identified service catalog items are evaluated. This is done by understanding the system landscape and support services available for respective services. Say, for example, in an audit of an e-mail system, it is found that the BlackBerry service is running on a single system architecture, the administrator demonstrates that in case of server failure a standby server can be made available to install the BlackBerry application and, further, the auditor is convinced that the administrator has skills to restore the application. In such a scenario, the BlackBerry service can be presumed to be at level 2 (standby can be arranged) in system architecture maturity and level 2 (skill set available) in support architecture maturity. A similar exercise of the entire service catalog results in output as shown in Figure 3.

Calculate Availability Percentage

In the proposed availability risk assessment methodology, an MTBF and MTTR matrix is created. Figure 4 shows the template that is used for creating the matrix. This matrix is created empirically by assigning acceptable uptime hours against each of the system architecture maturity levels under the MTBF column. A corresponding acceptable repair time value (MTTR) is assigned for every support capability maturity level. The MTBF value for the first three levels of system architecture maturity will be the same, as effectively the service is operating on a single system. The difference in maturity level is an indicator of the capability that exists in the environment to arrange standby or alternate systems.

The repair time (MTTR) is dependent not only on the support architectural maturity but also on system architectural maturity. This implies that, given a particular level of support maturity, the time taken to restore a service would decrease with an increase in the system maturity level.

While assigning values, it should be noted that the values are not biased by the existing system and vendor-specific experiences; rather, the values should be an indicator of what the organization considers as “enterprise grade” for its system’s uptime and acceptable resolution time from its support service. A practical approach in creating this matrix would make the availability percentage closer to reality.

Using the MTBF and MTTR matrix, respective MTBF and MTTR values for each service catalog item are derived. Continuing with the earlier example, the BlackBerry service has system architecture maturity level 2; hence, the corresponding MTBF value of 4,380 hours is taken and, as the support architecture maturity level is 2, the MTTR value is derived from the intersection of the maturity levels, which in this case is 16 hours. Using the availability percentage formula, i.e., “MTBF / (MTBF+MTTR) x 100,” the availability percentage for BlackBerry service is rated at 99.636 percent.

Applying the aforementioned approach to the entire service catalog, an availability risk assessment sheet is prepared, quantifying the availability percentage against each service as shown in Figure 5.

Conclusion

The availability risk assessment methodology provides a quantitative approach for conducting availability risk assessment of IT services. This methodology helps in engaging with management to derive an acceptable level of service and gives prescriptive input for achieving the desired service levels. Using this methodology, the desired availability percentage can be achieved by appropriately focusing on improving system or support maturity. The baseline provided by the availability risk assessment exercise can also be used for benchmarking and reporting the performance of IT operations. In addition, this methodology can assist the IS auditor in performing availability risk assessment of new systems that are in the design stage, thereby providing valuable input to management at an early stage of system development.

Endnotes

1 Tipton, Harold F.; Micki Krause; Information Security Management Handbook, 6th Edition, Auerbach Publications, 2007
2 Fischer, Urs; “Risk IT: Based on COBIT Objectives and Principles,” ISACA, vol. 4, 2009
3 Singleton, Tommie W.; “What Every IT Auditor Should Know About Auditing Information Security,” Information Systems Control Journal, vol. 2, 2007
4 International Electrotechnical Commission (IEC), “Analysis techniques for system reliability—Procedure for failure mode and effects analysis (FMEA),” IEC 60812, 2006
5 Tryfonas, T.; D. Gritzalis; S. Kokolakis; “A Qualitative Approach to Information Availability,” Proceedings of the IFIP Tc11 15th Annual Working Conference on Information Security for Global Information Infrastructures, S. Qing and J. H. Eloff, Eds., IFIP Conference Proceedings, vol. 175, Kluwer B.V., The Netherlands, 2000, p. 37-48
6 Bass, Len; et al; Software Architecture in Practice, 2nd Edition, Pearson Education, 2003, p. 79
7 Miler, Jakub; “A Service-oriented Approach to Identification of IT Risk,” Proceedings of the TEHOSS’ 2005 First IEEE International Conference on Technologies for Homeland Security and Safety, 2005
8 Op cit, Bass 2003
9 Carnegie Mellon University, Systems Security Engineering— Capability Maturity Model: Model Description Document, Version 2.0, 1 April 1999

Hariharan, CISA
is head of IT infrastructure and security for a leading media company in India. He has more than 18 years of experience in setting up IT departments and introducing IT governance practices within the organization. He has worked in diverse environments covering remote sensing, geographic information systems, automobile manufacturing, heavy engineering and media. He is a guest lecturer to management institutes and a member of curriculum review committees of academic institutions.