Reliability, availability and serviceability explained

Reliability, availability and serviceability (RAS), also known as reliability, availability, and maintainability (RAM), is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by IBM as a term to describe the robustness of their mainframe computers.[1] [2]

Computers designed with higher levels of RAS have many features that protect data integrity and help them stay available for long periods of time without failure.[3] This data integrity and uptime is a particular selling point for mainframes and fault-tolerant systems.

Definitions

While RAS originated as a hardware-oriented term, systems thinking has extended the concept of reliability-availability-serviceability to systems in general, including software:[4]

Note the distinction between reliability and availability: reliability measures the ability of a system to function correctly, including avoiding data corruption, whereas availability measures how often the system is available for use, even though it may not be functioning correctly. For example, a server may run forever and so have ideal availability, but may be unreliable, with frequent data corruption.[6]

Failure types

Physical faults can be temporary or permanent:

Failure responses

Transient and intermittent faults can typically be handled by detection and correction by e.g., ECC codes or instruction replay (see below). Permanent faults will lead to uncorrectable errors which can be handled by replacement by duplicate hardware, e.g., processor sparing, or by the passing of the uncorrectable error to high level recovery mechanisms. A successfully corrected intermittent fault can also be reported to the operating system (OS) to provide information for predictive failure analysis.

Hardware features

Example hardware features for improving RAS include the following, listed by subsystem:

Fault-tolerant designs extended the idea by making RAS to be the defining feature of their computers for applications like stock market exchanges or air traffic control, where system crashes would be catastrophic. Fault-tolerant computers (e.g., see Tandem Computers and Stratus Technologies), which tend to have duplicate components running in lock-step for reliability, have become less popular, due to their high cost. High availability systems, using distributed computing techniques like computer clusters, are often used as cheaper alternatives.

See also

External links

Notes and References

  1. Book: Reliable computer systems: design and evaluation. Daniel P.. Siewiorek. Robert S.. Swarz. Robert S. Swarz. 1998. 508. Taylor & Francis . 9781568810928 . . "The acronym RAS (reliability, accessibility and serviceability) came into widespread acceptance at IBM as the replacement for the subset notion of recovery management."
  2. Data processor, Issues 13-17. Data Processing Division, International Business Machines Corp., 1970. 1970. - "The dependability [...] experienced by other System/370 users is the result of a strategy based on RAS (Reliability-Availability-Serviceability)"
  3. Web site: Big iron lessons, Part 2: Reliability and availability: What's the difference?. Sam. Siewert. Mar 2005.
  4. For example:Book: Laros III. James H.. et al.. Energy-Efficient High Performance Computing: Measurement and Tuning. SpringerBriefs in Computer Science. 4 September 2012. Springer Science & Business Media. 2012. 8. 9781447144922. 2014-07-08. Historically, Reliability Availability and Serviceability (RAS) systems were commonly provided by vendors on mainframe class systems. [...] The RAS system shall be a systematic union of software and hardware for the purpose of managing and monitoring all hardware and software components of the system to their individual potential..
  5. Book: E.J. McClusky . S. Mitra . amp . "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press. 2004.
  6. Book: Spencer. Richard H.. Floyd. Raymond E.. Perspectives on Engineering. 11 July 2011. Bloomington, Indiana. AuthorHouse. 2011. 33. 9781463410919. 2014-05-05. [...] a system server may have excellent availability (runs forever), but continues to have frequent data corruption (not very reliable)..
  7. Web site: Self Checking in Current Floating-Point Units. Proceedings of 2011 20th IEEE Symposium on Computer Arithmetic. https://wayback.archive-it.org/all/20120124194631/http://www.acsel-lab.com/arithmetic/papers/ARITH20/ARITH20_Lipetz.pdf. dead. 2012-01-24. Daniel Lipetz. Eric Schwarz. amp. 2011. 2012-05-06.
  8. Web site: 10.1.1.85.5994. IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective. IBM Journal of Research and Development. Volume 43 Issue 5. L. Spainhower . T. A. Gregg . amp . September 1999.
  9. Web site: Intel Instruction Replay Technology Detects and Corrects Errors. 2012-12-07.
  10. Web site: Memory technology evolution: an overview of system memory technologies Technology brief, 9th edition (page 8). HP. dead. https://web.archive.org/web/20110724013507/http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00256987/c00256987.pdf. 2011-07-24.
  11. Web site: PCI Express Provides Enterprise Reliability, Availability, and Serviceability. Intel Corp.. 2003.
  12. Web site: Best Practices for Data Reliability with Oracle VM Server for SPARC. 2013-07-02.
  13. Web site: IBM Power Redundancy considerations. 2013-07-02.