Why Cryptographic Infrastructures Require High Availability

Author: Adam Cason
Date Published: 13 April 2020

In today’s 24/7, hyperconnected world, system failures are simply not an option. Modern society depends on unbroken connectivity, and one of the most critical services requiring high availability is cryptographic infrastructure.

It is not a stretch to say that encryption makes the world go around. Every phone call, email, credit card transaction and browser-based application depends on encryption to ensure the confidentiality, integrity and authenticity of information. In any business or organization—whether government agencies or private corporations—encryption is what stands between sensitive information and those who try to steal it.

The vital role of encryption requires that cryptographic infrastructures be built on a high availability architecture.

The vital role of encryption requires that cryptographic infrastructures be built on a high availability (HA) architecture. HA architectures prevent downtime due to failures of any kind, such as hardware or software failures or damaging environmental conditions such as power outages, flooding or extreme storms. And, as we have seen over the past weeks and months of 2020, events such as the spread of the COVID-19 coronavirus resulting in work-from-home arrangements or even mass absenteeism can reduce an organization’s ability to maintain its IT ecosystems. True HA architectures account for all these types of scenarios, and even system updates and maintenance can be accomplished without taking core systems offline.

Reliability engineering uses 3 principles of systems design to achieve high availability:

  • Eliminate single-points-of-failure
  • Ensure that crossover or failover points are reliable for redundant systems
  • Detect and react to failures in real time

To meet regulatory requirements and provide the highest level of security for encryption keys, the vast majority of cryptographic infrastructures are built around a FIPS 140-2 Level 3-validated hardware security module (HSM). While these devices are extremely robust and reliable, they are, nonetheless, subject to mean time between failure (MTBF) ratings. Over time, a certain percentage will fail, impacting all devices and applications that rely on HSMs for key management.

A cryptographic infrastructure based on a single HSM falls short of what is required for an HA system. While the use of a second HSM for redundancy purposes improves the situation, it still does not inherently have the ability to detect and immediately react to failures. To achieve true fault tolerance in a cryptographic infrastructure, 2 or more HSMs must work together to not only prevent system downtime, but also to absorb transaction load and increase scalability.

If organizations deploy a well-architected cryptographic management platform such as load balancers to distribute network traffic, hardware devices such as the Guardian Series 3 to set up clusters of encryption devices,1 or use emerging technologies such as cloud-based HSMs, the risk of production downtime can be avoided. How? In the event of a natural disaster, Internet outage or attempted data attack, automatic fault detection and seamless failover keep infrastructure online while supporting the cryptographic needs of customers and applications. Typically, if these were to happen, an HSM goes into a tamper state and erases all sensitive information. This is intended behavior used to maintain the security of the keys stored within that HSM, but the process of restoring it is time intensive. However, by deploying multiple HSMs sharing a common master key, additional redundancy can be put into place, allowing organizations to remain in production.

According to Statista research, the average cost for an hour of IT downtime ranges from US$200,000 to upward of US$5 million, depending on the type of enterprise.2 In addition to the cost of IT downtime, think about your organization’s reputation, increased application performance, reduced customer impact and reduced risk of data loss. With all this considered, high availability is a winning choice for most enterprises.

Adam Cason
Is director of product marketing at Futurex. He is responsible for the company’s global go-to-market strategy, technical documentation portfolio, and engagement for customer and partner relationships. He has a strong technical background and deep knowledge of enterprise-class cryptographic ecosystems and is a subject matter expert in hardware security modules and key management. Cason started his career at Futurex as a solutions architect, working closely with customers on product deployments, infrastructure analysis and system architecture.

Endnotes

1 Futurex, Guardian Series 3
2 Statista Research Department, “Average Cost Per Hour of Enterprise Server Downtime Worldwide in 2019,” 2 March 2020