In software engineering, software aging is the tendency for software to fail or cause a system failure after running continuously for a certain time, or because of ongoing changes in systems surrounding the software. Software aging has several causes, including the inability of old software to adapt to changing needs or changing technology platforms, and the tendency of software patches to introduce further errors. As the software gets older it becomes less well-suited to its purpose and will eventually stop functioning as it should. Rebooting or reinstalling the software can act as a short-term fix.[1] A proactive fault management method to deal with the software aging incident is software rejuvenation. This method can be classified as an environment diversity technique that usually is implemented through software rejuvenation agents (SRA).
The phenomenon was first identified by David Parnas, in an essay that explored what to do about it:[2]
"Programs, like people, get old. We can't prevent aging, but we can understand its causes, take steps to limit its effects, temporarily reverse some of the damage it has caused, and prepare for the day when the software is no longer viable."[3]
From both an academic and industrial point of view, the software aging phenomenon has increased. Recent research has focussed on clarifying its causes and effects.[4] Memory bloating and leaking, along with data corruption and unreleased file-locks are particular causes of software aging.
Software failures are a more likely cause of unplanned systems outages compared to hardware failures.[5] [6] This is because software exhibits over time an increasing failure rate due to data corruption, numerical error accumulation and unlimited resource consumption. In widely used and specialized software, a common action to clear a problem is rebooting because aging occurs due to the complexity of software which is never free of errors. It is almost impossible to fully verify that a piece of software is bug-free. Even high-profile software such as Windows and macOS must receive continual updates to improve performance and fix bugs. Software development tends to be driven by the need to meet release deadlines rather than to ensure long-term reliability.[7] Designing software that can be immune to aging is difficult. Not all software will age at the same rate as some users use the system more intensively than others.[8]
To prevent crashes or degradation, software rejuvenation can be employed proactively as inevitable aging leads to failures in software systems. This proactive technique was identified as a cost-effective solution during research at the AT&T Bell Laboratories on fault-tolerant software in the 1990s.[9] Software rejuvenation works by removing accumulated error conditions and freeing up system resources, for example by flushing operating system kernel tables, using garbage collection, reinitializing internal data structures, and perhaps the most well known rejuvenation method is to reboot the system.
There are simple techniques and complex techniques to achieve rejuvenation. The method most individuals are familiar with is the hardware or software reboot. A more technical example would be the web server software Apache's rejuvenation method. Apache implements one form of rejuvenation by killing and recreating processes after serving a certain number of requests.[10] Another technique is to restart virtual machines running in a cloud computing environment.[11]
The multinational telecommunications corporation AT&T has implemented software rejuvenation in the real time system collecting billing data in the United States for most telephone exchanges.[12]
Some systems which have employed software rejuvenation methods include:[13]
The IEEE International Symposium on Software Reliability Engineering (ISSRE) hosted the 5th annual International Workshop on Software Aging and Rejuvenation (woSAR) in 2013. Topics included:
See main article: Memory leak. Some programming languages, like C and C++, allow the programmer to allocate heap memory. Moreover, the programmer may be required to free the memory when the memory is no longer needed. Freeing the memory is necessary because some operating systems (OS) don't perform garbage collection when a process finishes. Over time, this is likely to consume more and more memory, eventually causing the computer to run out of memory.[14] In low memory conditions, the computer usually functions slower due to intense swapping and thrashing. When this happens, applications become sluggish or even unresponsive. If the computer runs out of both memory and swap space, the OS might automatically reboot — or even worse hang.[15]
Programs written in programming languages that use a garbage collector (e.g. Java) are less prone to memory leaks, since memory that is no longer referenced will be freed up by the garbage collector. This however does not mean it's impossible to write code that leaks memory in such languages.
Sometimes critical components of the OS itself can be a source of memory leaks. In Microsoft Windows, for example, the memory use of a Windows Explorer plug-in might drain the available memory to the point of making the entire computer unusable. A reboot might be needed.[16]
Two methods for implementing rejuvenation are:
See main article: Garbage collection (computer science). Garbage collection is a form of automatic memory management whereby the system automatically recovers unused memory. For example, the .NET Framework manages the allocation and release of memory for software running under it. But automatically tracking these objects takes time and is not perfect.
.NET based web services manage several logical types of memory such as stack, unmanaged and managed heap (free space). As the physical memory gets full, the OS writes rarely used parts of it to disk, so that it can reallocate it to another application, a process known as paging or swapping. But if the memory does need to be used, it must be reloaded from disk. If several applications are all making large demands, the OS can spend much of its time merely moving data between main memory and disk, a process known as disk thrashing.[17] Since the garbage collector has to examine all of the allocations to decide which are in use, it may exacerbate this thrashing. As a result, extensive swapping can lead to garbage collection cycles extended from milliseconds to tens of seconds. This results in usability problems.