Multiply–accumulate operation explained

In computing, especially digital signal processing, the multiply–accumulate (MAC) or multiply-add (MAD) operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator (MAC unit); the operation itself is also often called a MAC or a MAD operation. The MAC operation modifies an accumulator a: a \gets a + (b \times c)

When done with floating-point numbers, it might be performed with two roundings (typical in many DSPs), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA) or fused multiply–accumulate (FMAC).

Modern computers may contain a dedicated MAC, consisting of a multiplier implemented in combinational logic followed by an adder and an accumulator register that stores the result. The output of the register is fed back to one input of the adder, so that on each clock cycle, the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers. Percy Ludgate was the first to conceive a MAC in his Analytical Machine of 1909,[1] and the first to exploit a MAC for division (using multiplication seeded by reciprocal, via the convergent series). The first modern processors to be equipped with MAC units were digital signal processors, but the technique is now also common in general-purpose processors.[2] [3] [4] [5]

In floating-point arithmetic

When done with integers, the operation is typically exact (computed modulo some power of two). However, floating-point numbers have only a certain amount of mathematical precision. That is, digital floating-point arithmetic is generally not associative or distributive. (See .)Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add). IEEE 754-2008 specifies that it must be performed with one rounding, yielding a more accurate result.[6]

Fused multiply–add

A fused multiply–add (FMA or fmadd)[7] is a floating-point multiply–add operation performed in one step (fused operation), with a single rounding. That is, where an unfused multiply–add would compute the product, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire expression to its full precision before rounding the final result down to N significant bits.

A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products:

Fused multiply–add can usually be relied on to give more accurate results. However, William Kahan has pointed out that it can give problems if used unthinkingly.[8] If is evaluated as (following Kahan's suggested notation in which redundant parentheses direct the compiler to round the term first) using fused multiply–add, then the result may be negative even when due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated.

When implemented inside a microprocessor, an FMA can be faster than a multiply operation followed by an add. However, standard industrial implementations based on the original IBM RS/6000 design require a 2N-bit adder to compute the sum properly.[9]

Another benefit of including this instruction is that it allows an efficient software implementation of division (see division algorithm) and square root (see methods of computing square roots) operations, thus eliminating the need for dedicated hardware for those operations.[10]

Dot product instruction

Some machines combine multiple fused multiply add operations into a single step, e.g. performing a four-element dot-product on two 128-bit SIMD registers a0×b0 + a1×b1 + a2×b2 + a3×b3 with single cycle throughput.

Support

The FMA operation is included in IEEE 754-2008.

The Digital Equipment Corporation (DEC) VAX's POLY instruction is used for evaluating polynomials with Horner's rule using a succession of multiply and add steps. Instruction descriptions do not specify whether the multiply and add are performed using a single FMA step.[11] This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977.

The 1999 standard of the C programming language supports the FMA operation through the fma standard math library function and the automatic transformation of a multiplication followed by an addition (contraction of floating-point expressions), which can be explicitly enabled or disabled with standard pragmas . The GCC and Clang C compilers do such transformations by default for processor architectures that support FMA instructions. With GCC, which does not support the aforementioned pragma,[12] this can be globally controlled by the -ffp-contract command line option.[13]

The fused multiply–add operation was introduced as "multiply–add fused" in the IBM POWER1 (1990) processor,[14] but has been added to numerous other processors since then:

See also

Notes and References

  1. Web site: The Feasibility of Ludgate's Analytical Machine . live . https://web.archive.org/web/20190807233229/http://www.fano.co.uk/ludgate/ . 2019-08-07 . 2020-08-30.
  2. Lyakhov. Pavel. Valueva. Maria. Valuev. Georgii. Nagornov. Nikolai. January 2020. A Method of Increasing Digital Filter Performance Based on Truncated Multiply-Accumulate Units. Applied Sciences. en. 10. 24. 9052. 10.3390/app10249052. free.
  3. Book: Tung Thanh Hoang. Sjalander. M.. Larsson-Edefors. P.. 2009 IEEE International Symposium on Parallel & Distributed Processing . Double Throughput Multiply-Accumulate unit for FlexCore processor enhancements . May 2009. https://ieeexplore.ieee.org/document/5161212. 1–7. 10.1109/IPDPS.2009.5161212. 978-1-4244-3751-1. 14535090.
  4. 2020-03-01. PV-MAC: Multiply-and-accumulate unit structure exploiting precision variability in on-device convolutional neural networks. Integration. en. 71. 76–85. 10.1016/j.vlsi.2019.11.003. 0167-9260. Kang. Jongsung. Kim. Taewhan. 211264132 .
  5. Web site: mad - ps. 20 November 2019 . 2021-08-14.
  6. Web site: Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs . nvidia . Nathan . Whitehead . Alex . Fit-Florea . 2011 . 2013-08-31.
  7. Web site: fmadd instrs. .
  8. Web site: IEEE Standard 754 for Binary Floating-Point Arithmetic . William . Kahan . William Morton Kahan . 1996-05-31.
  9. Floating-Point Fused Multiply–Add Architectures . May 2007 . Eric . Quinnell . PhD . 2011-03-28.
  10. 10.1.1.85.9648 . Software Division and Square Root Using Goldschmidt's Algorithms . Peter . Markstein . November 2004 . 6th Conference on Real Numbers and Computers .
  11. Web site: VAX instruction of the week: POLY . dead . https://web.archive.org/web/20200213093219/http://uranium.vaxpower.org/~isildur/vax/week.html . 2020-02-13.
  12. Web site: Bug 20785 - Pragma STDC * (C99 FP) unimplemented . 2022-02-02 . gcc.gnu.org.
  13. Web site: Optimize Options (Using the GNU Compiler Collection (GCC)). 2022-02-02. gcc.gnu.org.
  14. Montoye . R. K. . Hokenek . E. . Runyon . S. L. . Design of the IBM RISC System/6000 floating-point execution unit . IBM Journal of Research and Development . January 1990 . 34 . 1 . 59–70 . 10.1147/rd.341.0059.
  15. Web site: Godson-3 Emulates x86: New MIPS-Compatible Chinese Processor Has Extensions for x86 Translation.
  16. Web site: Hollingsworth . Brent . New "Bulldozer" and "Piledriver" Instructions . AMD Developer Central . October 2012.
  17. Web site: Intel adds 22nm octo-core 'Haswell' to CPU design roadmap . The Register . 2008-08-19 . https://web.archive.org/web/20120217051330/http://www.reghardware.com/2008/08/19/idf_intel_architecture_roadmap/ . 2012-02-17 . dead .
  18. Web site: STM32 Cortex-M33 MCUs programming manual . ST . 2024-05-06.