Machine-check exception explained

A machine check exception (MCE) is a type of computer error that occurs when a problem involving the computer's hardware is detected. With most mass-market personal computers, an MCE indicates faulty or misconfigured hardware.

The nature and causes of MCEs can vary by architecture and generation of system. In some designs, an MCE is always an unrecoverable error, that halts the machine, requiring a reboot. In other architectures, some MCEs may be non-fatal, such as for single-bit errors corrected by ECC memory. On some architectures, such as PowerPC, certain software bugs can cause MCEs, such as an invalid memory access. On other architectures, such as x86, MCEs typically originate from hardware only.

Reporting

IBM mainframe operating systems

IBM System/360 Operating System (OS/360) records input/output errors in a dataset called SYS1.LOGREC. Since then IBM has coined the term error recording data set (ERDS) for successor versions that allow the installation to choose the name and for operating systems not derived from OS/360.[1]

OS/360

In OS/360, the installation can choose several levels of support for handling machine checks. The most sophisticated, Machine Check Handler (MCH), records failure data on SYS1.LOGREC and attempts recovery. The installation can print those data using the Environmental Record Editing and Printing Program (EREP) service aid or the stand-alone version SEREP. The MCH can handle memory failures in refreshable nucleus control sections by reading a fresh copy from SYS1.ASRLIB and can handle memory errors in SVC transient areas by reading a fresh copy of the SVC module from SYS1.SVCLIB.

z/OS

In z/OS the installation can either use an ERDS or can define a z/OS System Logger log stream[2] to hold the error data. As with OS/360, the installation uses EREP to print those data; SEREP is no longer available. The MCH is no longer optional, and handles many more failure modes than the OS/360 MCH.

Microsoft Windows

On Microsoft Windows platforms, in the event of an unrecoverable MCE, the system generates a BugCheck — also called a STOP error, or a Blue Screen of Death.

More recent versions of Windows use the Windows Hardware Error Architecture (WHEA), and generate STOP code 0x124, WHEA_UNCORRECTABLE_ERROR. The four parameters (in parentheses) will vary, but the first is always 0x0 for an MCE.[3] Example:

STOP: 0x00000124 (0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)

Older versions of Windows use the Machine Check Architecture, with STOP code 0x9C, MACHINE_CHECK_EXCEPTION.[4] Example:

STOP: 0x0000009C (0x00000030, 0x00000002, 0x00000001, 0x80003CBA)

Linux

On Linux, the kernel writes messages about MCEs to the kernel message log and the system console. When the MCEs are not fatal, they will also typically be copied to the system log and/or systemd journal. For some systems, ECC and other correctable errors may be reported through MCE facilities.[5]

Example:

CPU 0: Machine Check Exception: 0000000000000004 Bank 2: f200200000000863 Kernel panic: CPU context corrupt

Problem types

Some of the main hardware problems that cause MCEs include:

Possible causes

Machine checks are a hardware problem, not a software problem. They are often the result of overclocking or overheating. In some cases, the CPU will shut itself off once passing a thermal limit to avoid permanent damage. But they can also be caused by bus errors introduced by other failing components, like memory or I/O devices. Possible causes include:

Cooling problems are usually obvious upon inspection. A failing motherboard or processor can be identified by swapping them with functioning parts. Memory can be checked by booting to a diagnostic tool, like memtest86. Non-essential failing I/O devices and controllers can be identified by unplugging them if possible or disabling the devices to see if the problem disappears. If the failures typically only occur fairly soon after the OS is booted or not at all or not for days, it may be suggestive of a power supply issue. With a power supply problem, the failure often occurs when power demand peaks as the OS starts up any external devices for use.

Decoding MCEs

For IA-32 and Intel 64 processors, consult the Intel 64 and IA-32 Architectures Software Developer's Manual[6] Chapter 15 (Machine-Check Architecture), or the Microsoft KB Article on Windows Exceptions.[7]

Programs to decode Intel and AMD MCEs

See also

External links

Notes and References

  1. Environmental Record Editing and Printing Program (EREP) 3.5 - User's Guide . GC35-0151-50 . September 30, 2021 . Chapter 1. Introducing EREP . https://www.ibm.com/servers/resourcelink/svc00100.nsf/pages/zOSV2R5gc350151/$file/ifc1000_v2r5.pdf#page=17 . 1 . . February 20, 2023 .
  2. System Programmer's Guide to: z/OS System Logger . SG24-6898-01 . July 2007 . Second . Redbooks . . February 20, 2023 .
  3. Web site: Bug Check 0x124: WHEA_UNCORRECTABLE_ERROR. Microsoft. 2022-11-03. 2022-12-11.
  4. Web site: Bug Check 0x9C: MACHINE_CHECK_EXCEPTION. Microsoft. 2021-12-14. 2022-12-11.
  5. Web site: mcelog not working with AMD processor family 16 and above on SLES11 SP3. SuSE. 2022-09-27. 2022-12-11.
  6. Book: Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide, Part 2 . Intel Corporation . November 2018 . Machine Check Architecture . https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-volume-3b-system-programming-guide-part-2.
  7. Web site: Stop error message in Windows XP that you may receive: "0x0000009C (0x00000004, 0x00000000, 0xb2000000, 0x00020151)" . . 2015-12-07 . 2017-07-13.
  8. Web site: rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool . Mauro Carvalho Chehab (mchehab) . github.com . 2023-02-20 . 2023-02-20 .
  9. Web site: Machine-check exception . wiki.archlinux.org . 2021-05-08 . 2023-02-21 .
  10. Web site: ECC RAM . wiki.gentoo.org . 2022-12-30 . 2023-02-21 .
  11. Web site: mcelog: Advanced hardware error handling for x86 Linux . 2015-04-20 . 2017-07-13.
  12. Web site: x86/mce: Factor out and deprecate the /dev/mcelog driver . git.kernel.org . 2017-03-28 . 2023-02-21 .
  13. Web site: x86/mce: Factor out and deprecate the /dev/mcelog driver . github.com/torvalds/linux/ . 2017-03-28 . 2023-02-21 .
  14. Web site: parsemce: Linux Machine check exception handler parser. . 2003-07-22 . 2017-07-13.