High availability software is software used to ensure that systems are running and available most of the time. High availability is a high percentage of time that the system is functioning. It can be formally defined as (1 – (down time/ total time))*100%. Although the minimum required availability varies by task, systems typically attempt to achieve 99.999% (5-nines) availability. This characteristic is weaker than fault tolerance, which typically seeks to provide 100% availability, albeit with significant price and performance penalties.
High availability software is measured by its performance when a subsystem fails, its ability to resume service in a state close to the state of the system at the time of the original failure, and its ability to perform other service-affecting tasks (such as software upgrade or configuration changes) in a manner that eliminates or minimizes down time. All faults that affect availability – hardware, software, and configuration need to be addressed by High Availability Software to maximize availability.
Typical high availability software provides features that:
Enable hardware and software redundancy:These features include:
A service is not available if it cannot service all the requests being placed on it. The “scale-out” property of a system refers to the ability to create multiple copies of a subsystem to address increasing demand, and to efficiently distribute incoming work to these copies (Load balancing (computing)) preferably without shutting down the system. High availability software should enable scale-out without interrupting service.
Enable active/standby communication (notably Checkpointing):Active subsystems need to communicate to standby subsystems to ensure that the standby is ready to take over where the active left off. High Availability Software can provide communications abstractions like redundant message and event queues to help active subsystems in this task. Additionally, an important concept called “checkpointing” is exclusive to highly available software. In a checkpointed system, the active subsystem identifies all of its critical state and periodically updates the standby with any changes to this state. This idea is commonly abstracted as a distributed hash table – the active writes key/value records into the table and both the active and standby subsystems read from it. Unlike a “cloud” distributed hash table (Chord (peer-to-peer), Kademlia, etc.) a checkpoint is fully replicated. That is, all records in the “checkpoint” hash table are readable so long as one copy is running.[1] Another technique, called an [application checkpoint], periodically saves the entire state of a program.[2]
Enable in-service upgrades:In Service Software Upgrade is the ability to upgrade software without degrading service. It is typically implemented in redundant systems by executing what is called a “rolling” upgrade—upgrading the standby while the active provides service, failing over, and then upgrading the old active. Another important feature is the ability to rapidly fall back to an older version of the software and configuration if the new version fails.[3] [4]
Minimize standby latency and ensure standby correctness:Standby latency is defined as the time between when a standby is told to become active and when it is actually providing service. “Hot” standby systems are those that actively update internal state in response to active system checkpoints, resulting in millisecond down times. “Cold” standby systems are offline until the active fails and typically restart from a “baseline” state. For example, many cloud solutions will restart a virtual machine on another physical machine if the underlying physical machine fails. “Cold” fail over standby latency can range from 30+ seconds to several minutes. Finally, “warm” standby is an informal term encompassing all systems that are running yet must do some internal processing before becoming active. For example, a warm standby system might be handling low priority jobs – when the active fails it aborts these jobs and reads the active's checkpointed state before resuming service. Warm standby latencies depend on how much data is checkpointed but typically have a few seconds latency.
High availability software can help engineers create complex system architectures that are designed to minimize the scope of failures and to handle specific failure modes. A “normal” failure is defined as one which can be handled by the software architecture's, while a “catastrophic” failure is defined as one which is not handled. A catastrophic failure therefore causes a service outage. However, the software can still greatly increase availability by automatically returning to an in-service state as soon as the catastrophic failure is remedied.
The simplest configuration (or “redundancy model”) is 1 active, 1 standby, or 1+1. Another common configuration is N+1 (N active, 1 standby), which reduces total system cost by having fewer standby subsystems. Some systems use an all-active model, which has the advantage that “standby” subsystems are being constantly validated.
Configurations can also be defined with active, hot standby, and cold standby (or idle) subsystems, extending the traditional “active+standby” nomenclature to “active+standby+idle” (e.g. 5+1+1). Typically, “cold standby” or “idle” subsystems are active for lower priority work. Sometimes these systems are located far away from their redundant pair in a strategy called geographic redundancy.[5] This architecture seeks to avoid loss of service from physically-local events (fire, flood, earthquake) by separating redundant machines.
Sophisticated policies can be specified by high availability software to differentiate software from hardware faults, and to attempt time-delayed restarts of individual software processes, entire software stacks, or entire systems.
In the past 20 years telecommunication networks and other complex software systems have become essential parts of business and recreational activities.
“At the same time [as the economy is in a downturn], 60% almost -- that's six out of 10 businesses -- require 99.999. That's four nines or five nines of availability and uptime for their mission-critical line-of-business applications.And 9% of the respondents, so that's almost one out of 10 companies, say that they need greater than five nines of uptime. So what that means is, no downtime. In other words, you have got to really have bulletproof, bombproof applications and hardware systems. So you know, what do you use? Well one thing you have high-availability clusters or you have the more expensive and more complex fault-tolerance servers.”[6]
Telecommunications: High Availability Software is an essential component of telecommunications equipment since a network outage can result in significant loss in revenue for telecom providers and telephone access to emergency services is an important public safety issue.
Defense/Military: Recently High Availability Software has found its way into defense projects as an inexpensive way to provide availability for crewed and uncrewed vehicles[7]
Space: High Availability Software is proposed for use of non-radiation hardened equipment in space environments. Radiation hardened electronics is significantly more expensive and lower performance than off-the-shelf equipment. But High Availability Software running on a single or pair of rad-hardened controllers can manage many redundant high performance non-rad-hard computers, potentially failing over and resetting them in the event of a fault.[8]
Typical cloud services provide a set of networked computers (typical a virtual machine) running a standard server OS like Linux. Computers can often communicate with other instances within the same data center for free (tenant network) and to outside computers for fee. The cloud infrastructure may provide simple fault detection and restart at the virtual machine level. However, restarts can take several minutes resulting in lower availability. Additionally, cloud services cannot detect software failures within the virtual machines. High Availability Software running inside the cloud virtual machines can detect software (and virtual machine) failures in seconds and can use checkpointing to ensure that standby virtual machines are ready to take over service.
The Service Availability Forum defines standards for application-aware High Availability.[9]