A machine check exception (MCE) is a type of computer error that occurs when a problem involving the computer's hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware.
The nature and causes of MCEs can vary by architecture and generation of system. In some designs, an MCE is always an unrecoverable error, that halts the machine, requiring a reboot. In other architectures, some MCEs may be non-fatal, such as for single-bit errors corrected by ECC memory. On some architectures, such as PowerPC, certain software bugs can cause MCEs, such as an invalid memory access. On other architectures, such as x86, MCEs typically originate from hardware only.
IBM System/360 Operating System (OS/360) records input/output errors in a dataset called SYS1.LOGREC. Since then IBM has coined the term error recording data set (ERDS) for successor versions that allow the installation to choose the name and for operating systems not derived from OS/360.[1]
In OS/360, the installation can choose several levels of support for handling machine checks. The most sophisticated, Machine Check Handler (MCH), records failure data on SYS1.LOGREC and attempts recovery. The installation can print those data using the Environmental Record Editing and Printing Program (EREP) service aid or the stand-alone version SEREP. The MCH can handle memory failures in refreshable nucleus control sections by reading a fresh copy from SYS1.ASRLIB and can handle memory errors in SVC transient areas by reading a fresh copy of the SVC module from SYS1.SVCLIB.
In z/OS the installation can either use an ERDS or can define a z/OS System Logger log stream[2] to hold the error data. As with OS/360, the installation uses EREP to print those data; SEREP is no longer available. The MCH is no longer optional, and handles many more failure modes than the OS/360 MCH.
On Microsoft Windows platforms, in the event of an unrecoverable MCE, the system generates a BugCheck — also called a STOP error, or a Blue Screen of Death.
More recent versions of Windows use the Windows Hardware Error Architecture (WHEA), and generate STOP code 0x124, WHEA_UNCORRECTABLE_ERROR. The four parameters (in parentheses) will vary, but the first is always 0x0 for an MCE.[3] Example:
STOP: 0x00000124 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)
Older versions of Windows use the Machine Check Architecture, with STOP code 0x9C, MACHINE_CHECK_EXCEPTION.[4] Example:
STOP: 0x0000009C (0x00000030, 0x00000002, 0x00000001, 0x80003CBA)
On Linux, the kernel writes messages about MCEs to the kernel message log and the system console. When the MCEs are not fatal, they will also typically be copied to the system log and/or systemd journal. For some systems, ECC and other correctable errors may be reported through MCE facilities.[5]
Example:
CPU 0: Machine Check Exception: 0000000000000004 Bank 2: f200200000000863 Kernel panic: CPU context corrupt
Some of the main hardware problems that cause MCEs include:
Machine checks are a hardware problem, not a software problem. They are often the result of overclocking or overheating. In some cases, the CPU will shut itself off once passing a thermal limit to avoid permanent damage. But they can also be caused by bus errors introduced by other failing components, like memory or I/O devices. Possible causes include:
Cooling problems are usually obvious upon inspection. A failing motherboard or processor can be identified by swapping them with functioning parts. Memory can be checked by booting to a diagnostic tool, like memtest86. Non-essential failing I/O devices and controllers can be identified by unplugging them if possible or disabling the devices to see if the problem disappears. If the failures typically only occur fairly soon after the OS is booted or not at all or not for days, it may be suggestive of a power supply issue. With a power supply problem, the failure often occurs when power demand peaks as the OS starts up any external devices for use.
For IA-32 and Intel 64 processors, consult the Intel 64 and IA-32 Architectures Software Developer's Manual[6] Chapter 15 (Machine-Check Architecture), or the Microsoft KB Article on Windows Exceptions.[7]