Availability (system) explained

Availability is the probability that a system will work as required when required during the period of a mission. The mission could be the 18-hour span of an aircraft flight. The mission period could also be the 3 to 15-month span of a military deployment. Availability includes non-operational periods associated with reliability, maintenance, and logistics.

This is measured in terms of nines. Five-9's (99.999%) means less than 5 minutes when the system is not operating correctly over the span of one year.

Availability is only meaningful for supportable systems. As an example, availability of 99.9% means nothing after the only known source stops manufacturing a critical replacement part.

Definition

There are two kinds of availability.

Operational
Predicted

Operational availability is presumed to be the same as predicted availability until after operational metrics become available.

Availability

Operational availability is based on observations after at least one system has been built. This usually begins with the brassboard system that is used to complete system development, and continues with the first of kind used for live fire test and evaluation (LFTE). Organizations responsible for maintenance use this to evaluate the effectiveness of the maintenance philosophy.

A_o=

	T_m
	T_m+T_d

\begin{cases}A_o=Operational Availability\ T_m=Mission Duration\ T_d=Observed Down Time\end{cases}

Predicted availability is based on a model of the system before it is built.

A_p=

	T_m
	T_m+T_d

\begin{cases}A_p=Predicted Availability\ T_m=Mission Duration\ T_d=Model Down Time\end{cases}

Downtime is the total of all of the different contributions that compromise operation. For modeling, these are different aspects of the model, such as human-system interface for MTTR and reliability modeling for MTBF. For observation, these reflect the different areas of the organization, such as maintenance personnel and documentation for MTTR, and manufacturers and shippers for MLDT.

T_d=T_m x

	MTTR+MLDT+MAMDT
	MTBF

\begin{cases}T_d=Down Time\ T_m=Mission Duration\ MTTR=Mean Time To Recover\ MLDT=Mean Logistics Delay Time\ MAMDT=Mean Active Maintenance Down Time\ MTBF=Mean Time Between Failure\end{cases}

MTBF

Mean Time Between Failure (MTBF) depends upon the maintenance philosophy.

If a system is designed with both redundancy and automatic fault bypass, then MTBF is the anticipated lifespan of the system if these features cover all possible failure modes (infinity for all practical purposes). Such systems will continue without noticeable interruption when these conditions are satisfied unless there are secondary failures. This is called active redundancy, which requires no maintenance to prevent mission failure. Active redundancy is required for systems that cannot be maintained, such as satellites.

If a system has no redundancy, then MTBF is the inverse of failure rate,

λ=

	1
	MTBF

Systems with spare parts that are energized but that lack automatic fault bypass are not actually redundant because human action is required to restore operation after every failure. This depends upon Condition-based maintenance and Planned Maintenance System support.

MTTR

Mean Time To Recover (MTTR) is the length of time required to restore operation to specification.

This includes three values.

Mean Time To Discover
Mean Time To Isolate
Mean Time To Repair

Mean Time To Discover is the length of time that transpires between when a failure occurs and the system users become aware of the failure. There are two maintenance philosophies associated with Mean Time To Discover.

Condition-based maintenance (CBM)
Planned Maintenance System (PMS)

CBM works like your car where an oil indicator tells you when oil pressure is too low and a temperature indicator tells you when engine temperature is too high. There is zero time to discover a failure where an indicator is placed in front of a system operator.

PMS is required for silent failures that lack CBM. PMS works is periodic maintenance, like when you perform diagnostic tests on your car every 90 days (or 3,000 miles). A failure may occur any time during the 90 days, such as a broken light, but you will not become aware until you perform diagnostic test.

Mean Time To Discover is statistical when PMS is the dominant maintenance philosophy. For example, if a fault is discovered during PMS diagnostic procedure that is run every 10 days, the average fault duration will be 5 days. This creates a dependency between availability performance and labor costs. There is no such dependency associated with CBM.

Mean Time To Isolate is the average length of time required to identify a setting that needs to be adjusted or a component that needs to be replaced. This is dependent on documentation, training, and technical support. This tends to be less on systems that have CBM because users can begin with the list of items connected to the indicator used to notify users about the fault. This also tends to be less on fully documented systems.

Mean Time To Repair is the average length of time to restore operation. For mission critical systems, this is generally estimated by dividing time required to replace all parts by the number of replaceable parts.

MLDT

Mean Logistics Delay Time is the average time required to obtain replacement parts from the manufacturer and transport those parts to the work site.

MAMDT

Mean Active Maintenance Down Time is associated with Planned Maintenance System support philosophy. This is average amount of time while the system is not 100% operational because of diagnostic testing that requires down time.

For example, an automobile that requires 1 day of maintenance every 90 days has a Mean Active Maintenance Down Time of just over 1%.

This is separate from the type of down time associated with repair activities.

Nines

Availability expectations are described in terms of nines.

The following table shows the anticipated down-time for different availabilities for a mission time of one year. This is the typical time-span used with commercial systems.

Supportability

Systems that require maintenance are said to be supportable if they satisfy the following criteria.

Adequate labor for PMS diagnostics or fault indicators for failure modes that lack PMS
Access to manufacturing sources for replacement parts
Sufficient resources to sustain access to replacement parts and maintenance labor (money, lodging, ...)
Training and maintenance documentation suitable to restore operation
Transportation services to deliver replacement parts

Systems that lack any of these requirements are said to be unsupportable.

Mission Failure

Mission failure is the result of trying to use a system in its normal mode when it is not working.

P_F=1-A_o\begin{cases}P_F=Probability of Mission Failure\ A_o=Operational Availability\end{cases}

Apart from human error, mission failure results from the following causes.

Lack of instrumentation for the fault in the form of an indicator suitable for CBM
Failure after PMS diagnostic testing has been completed but before operation is attempted
Lack of diagnostic PMS testing for any failure mode that lacks CBM