FMA instruction set explained

The FMA instruction set is an extension to the 128 and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations.[1] There are two variants:

Instructions

FMA3 and FMA4 instructions have almost identical functionality, but are not compatible. Both contain fused multiply–add (FMA) instructions for floating-point scalar and SIMD operations, but FMA3 instructions have three operands, while FMA4 ones have four. The FMA operation has the form d = round(a · b + c), where the round function performs a rounding to allow the result to fit within the destination register if there are too many significant bits to fit within the destination.

The four-operand form (FMA4) allows a, b, c and d to be four different registers, while the three-operand form (FMA3) requires that d be the same register as a, b or c. The three-operand form makes the code shorter and the hardware implementation slightly simpler, while the four-operand form provides more programming flexibility.

See XOP instruction set for more discussion of compatibility issues between Intel and AMD.

FMA3 instruction set

CPUs with FMA3

Excerpt from FMA3

Supported commands include

Mnemonic Operation Mnemonic Operation
VFMADD result = + a · b + c VFMADDSUB result = a · b + c  for  i = 1, 3, ...
result = a · b − c  for  i = 0, 2, ...
VFNMADD result = − a · b + c
VFMSUB result = + a · b − c VFMSUBADD result = a · b − c  for  i = 1, 3, ...
result = a · b + c  for  i = 0, 2, ...
VFNMSUB result = − a · b − c
Note:

Explicit order of operands is included in the mnemonic using numbers "132", "213", and "231":

Postfix
1
Operation possible
memory operand
overwrites
132 a = a · c + b c (factor) a (other factor)
213 a = b · a + c c (summand)a (factor)
231 a = b · c + a c (factor) a (summand)

as well as operand format (packed or scalar) and size (single or double).

Postfix
2
precision size Postfix
2
precision size
SS Single 32 bit SD Double 64 bit
PSx 4× 32 bit PDx 2× 64 bit
PSy 8× 32 bit PDy 4× 64 bit
PSz 16× 32 bit PDz 8× 64 bit

This results in

EncodingMnemonicOperandsOperation
VEX.256.66.0F38.W1 98 /rVFMADD132PDyymm, ymm, ymm/m256a = a · c + b
VEX.256.66.0F38.W0 98 /rVFMADD132PSy
VEX.128.66.0F38.W1 98 /rVFMADD132PDxxmm, xmm, xmm/m128
VEX.128.66.0F38.W0 98 /rVFMADD132PSx
VEX.LIG.66.0F38.W1 99 /rVFMADD132SDxmm, xmm, xmm/m64
VEX.LIG.66.0F38.W0 99 /rVFMADD132SSxmm, xmm, xmm/m32
VEX.256.66.0F38.W1 A8 /rVFMADD213PDyymm, ymm, ymm/m256a = b · a + c
VEX.256.66.0F38.W0 A8 /rVFMADD213PSy
VEX.128.66.0F38.W1 A8 /rVFMADD213PDxxmm, xmm, xmm/m128
VEX.128.66.0F38.W0 A8 /rVFMADD213PSx
VEX.LIG.66.0F38.W1 A9 /rVFMADD213SDxmm, xmm, xmm/m64
VEX.LIG.66.0F38.W0 A9 /rVFMADD213SSxmm, xmm, xmm/m32
VEX.256.66.0F38.W1 B8 /rVFMADD231PDyymm, ymm, ymm/m256a = b · c + a
VEX.256.66.0F38.W0 B8 /rVFMADD231PSy
VEX.128.66.0F38.W1 B8 /rVFMADD231PDxxmm, xmm, xmm/m128
VEX.128.66.0F38.W0 B8 /rVFMADD231PSx
VEX.LIG.66.0F38.W1 B9 /rVFMADD231SDxmm, xmm, xmm/m64
VEX.LIG.66.0F38.W0 B9 /rVFMADD231SSxmm, xmm, xmm/m32

FMA4 instruction set

CPUs with FMA4

Excerpt from FMA4

Mnemonic (AT&T)OperandsOperation
VFMADDPDxxmm, xmm, xmm/m128, xmm/m128a = b·c + d
VFMADDPDyymm, ymm, ymm/m256, ymm/m256
VFMADDPSxxmm, xmm, xmm/m128, xmm/m128
VFMADDPSyymm, ymm, ymm/m256, ymm/m256
VFMADDSDxmm, xmm, xmm/m64, xmm/m64
VFMADDSSxmm, xmm, xmm/m32, xmm/m32

History

The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time. The history can be summarized as follows:

Compiler and assembler support

Different compilers provide different levels of support for FMA:

Notes and References

  1. "FMA3 and FMA4 are not instruction sets, they are individual instructions -- fused multiply add. They could be quite useful depending on how Intel and AMD implement them" Web site: Woltmann. George (Prime95). Intel AVX and GIMPS. mersenneforum.org/index.php. Great Internet Mersenne Prime Search (GIMPS) project. 27 July 2011.
  2. Web site: Maffeo. Robin. AMD and the Visual Studio 11 Beta. AMD. March 1, 2012. https://archive.today/20131109140742/http://developer.amd.com/community/blog/2012/03/01/amd-and-the-visual-studio-11-beta/. November 9, 2013. dead. 2018-11-07.
  3. Web site: CPU-Z - ID : y5z6gq . 2022-05-01.
  4. Web site: CPU-Z - ID : kr2mlx . 2022-05-01.
  5. Web site: AMD64 Architecture Programmer's Manual Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions . May 1, 2009 . AMD.
  6. Web site: New "Bulldozer" and "Piledriver" Instructions A step forward for high performance software development . October 2012 . AMD.
  7. Web site: Agner's CPU blog - Test results for AMD Ryzen. 2017-05-02.
  8. Web site: www.amd.com, FMA4 support model list .
  9. Web site: www.amd.com, FMA4 support model list .
  10. Web site: www.amd.com, FMA4 support model list .
  11. Web site: 128-Bit SSE5 Instruction Set . AMD Developer Central . 2008-01-28 . https://web.archive.org/web/20080115163416/http://developer.amd.com/SSE5 . 2008-01-15 . dead .
  12. Web site: Intel Advanced Vector Extensions Programming Reference . . 2008-04-05 .
  13. Web site: Intel Advanced Vector Extensions Programming Reference . . 2009-05-06.
  14. Web site: Striking a balance . May 6, 2009 . Dave Christie, AMD Developer blogs . https://archive.today/20120708101459/http://blogs.amd.com/developer/2009/05/06/striking-a-balance/ . July 8, 2012 . dead . 2018-11-07.
  15. Web site: New Bulldozer and Piledriver Instructions . AMD. 25 July 2013.
  16. Web site: Software Optimization Guide for AMD Family 15h Processors. AMD. 19 April 2012.
  17. Web site: Intel Architecture Instruction Set Extensions Programming Reference. Intel. 25 July 2013.
  18. Web site: The microarchitecture of Intel, AMD and VIA CPUs An optimization guide for assembly programmers and compiler makers . 2017-05-02.
  19. Web site: Gopalasubramanian . Ganesh . [PATCH] add znver1 processor. ]. 2015-03-10 . 2022-05-01.
  20. Web site: Pawar . Amit . [PATCH] Remove CpuFMA4 from Znver1 CPU Flags ]. 2015-08-07 . 2022-05-01.
  21. Web site: Discussion – Ryzen has undocumented support for FMA4. 2017-05-10.
  22. Web site: Stack Overflow comment by Mysticial. 2019-07-16. 2023-09-01. 2019-08-22. https://web.archive.org/web/20190822063407/https://stackoverflow.com/questions/57055756/arbitrary-position-2-input-shuffling-using-sse. bot: unknown.
  23. Web site: AMD Ryzen Machine Crashes to a Sequence of FMA3 Instructions. 16 March 2017 . 2017-09-10.
  24. Web site: Stack Overflow comment by Mysticial. 2019-07-16. 2023-09-01.
  25. Web site: AMD Bulldozer only FMA4 and XOP instructions are supported by GCC Intel still mute. https://web.archive.org/web/20111117001441/http://www.theinquirer.net/inquirer/news/2124866/amd-bulldozer-fma4-xop-instructions-supported-gcc. unfit. November 17, 2011. The Inquirer. Lawrence . Latif. Nov 14, 2011.
  26. Web site: FMA4 Intrinsics Added for Visual Studio 2010 SP1. 4 February 2013 .
  27. Web site: EKOPath man doc. 2013-07-24. https://web.archive.org/web/20160623224118/http://www.pathscale.com/node/272. 2016-06-23. dead.
  28. Web site: LLVM 3.1 Release Notes.
  29. Web site: Enable detection of AVX and AVX2 support through CPUID. 2012-04-26. LLVM.