Memory scrubbing explained

Memory scrubbing consists of reading from each computer memory location, correcting bit errors (if any) with an error-correcting code (ECC), and writing the corrected data back to the same location.[1]

Due to the high integration density of modern computer memory chips, the individual memory cell structures became small enough to be vulnerable to cosmic rays and/or alpha particle emission. The errors caused by these phenomena are called soft errors. Over 8% of DIMM modules experience at least one correctable error per year.[2] This can be a problem for DRAM and SRAM based memories. The probability of a soft error at any individual memory bit is very small. However, together with the large amount of memory modern computersespecially serversare equipped with, and together with extended periods of uptime, the probability of soft errors in the total memory installed is significant.

The information in an ECC memory is stored redundantly enough to correct single bit error per memory word. Hence, an ECC memory can support the scrubbing of the memory content. Namely, if the memory controller scans systematically through the memory, the single bit errors can be detected, the erroneous bit can be determined using the ECC checksum, and the corrected data can be written back to the memory.

Overview

It is important to check each memory location periodically, frequently enough, before multiple bit errors within the same word are too likely to occur, because the one bit errors can be corrected, but the multiple bit errors are not correctable, in the case of usual (as of 2008) ECC memory modules.

In order to not disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS setup program.

The normal memory reads issued by the CPU or DMA devices are checked for ECC errors, but due to data locality reasons they can be confined to a small range of addresses and keeping other memory locations untouched for a very long time. These locations can become vulnerable to more than one soft error, while scrubbing ensures the checking of the whole memory within a guaranteed time.

On some systems, not only the main memory (DRAM-based) is capable of scrubbing but also the CPU caches (SRAM-based). On most systems the scrubbing rates for both can be set independently. Because cache is much smaller than the main memory, the scrubbing for caches does not need to happen as frequently.

Memory scrubbing increases reliability, therefore it can be classified as a RAS feature.

Variants

There are usually two variants, known as patrol scrubbing and demand scrubbing. While they both essentially perform memory scrubbing and associated error correction (if it is doable), the main difference is how these two variants are initiated and executed. Patrol scrubbing runs in an automated manner when the system is idle, while demand scrubbing performs the error correction when the data is actually requested from main memory.[3]

See also

Notes and References

  1. Ronald K. Burek."The NEAR Solid-State Data Recorders".Johns Hopkins APL Technical Digest.1998.
  2. https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf DRAM Errors in the Wild: A Large-Scale Field Study
  3. Web site: Supermicro X9SRA motherboard manual . March 5, 2014 . February 22, 2015 . . 4–10.