CLMUL instruction set explained

Carry-less Multiplication (CLMUL) is an extension to the x86 instruction set used by microprocessors from Intel and AMD which was proposed by Intel in March 2008[1] and made available in the Intel Westmere processors announced in early 2010. Mathematically, the instruction implements multiplication of polynomials over the finite field GF(2) where the bitstring

a0a1\ldotsa63

represents the polynomial

a0+a1X+

2
a
2X

++a63X63

. The CLMUL instruction also allows a more efficient implementation of the closely related multiplication of larger finite fields GF(2k) than the traditional instruction set.[2]

One use of these instructions is to improve the speed of applications doing block cipher encryption in Galois/Counter Mode, which depends on finite field GF(2k) multiplication. Another application is the fast calculation of CRC values,[3] including those used to implement the LZ77 sliding window DEFLATE algorithm in zlib and pngcrush.[4]

ARMv8 also has a version of CLMUL. SPARC calls their version XMULX, for "XOR multiplication".

New instructions

The instruction computes the 128-bit carry-less product of two 64-bit values. The destination is a 128-bit XMM register. The source may be another XMM register or memory. An immediate operand specifies which halves of the 128-bit operands are multiplied. Mnemonics specifying specific values of the immediate operand are also defined:

InstructionOpcodeDescription
Perform a carry-less multiplication of two 64-bit polynomials over the finite field GF(2)[''X''].
PCLMULLQLQDQ xmmreg,xmmrm [rm:  66 0f 3a 44 /r 00]Multiply the low halves of the two registers.
PCLMULHQLQDQ xmmreg,xmmrm [rm:  66 0f 3a 44 /r 01]Multiply the high half of the destination register by the low half of the source register.
PCLMULLQHQDQ xmmreg,xmmrm [rm:  66 0f 3a 44 /r 10]Multiply the low half of the destination register by the high half of the source register.
PCLMULHQHQDQ xmmreg,xmmrm [rm:  66 0f 3a 44 /r 11]Multiply the high halves of the two registers.

A EVEX vectorized version (VPCLMULQDQ) is seen in AVX-512.

CPUs with CLMUL instruction set

The presence of the CLMUL instruction set can be checked by testing one of the CPU feature bits.

See also

Notes and References

  1. Web site: Intel Software Network . Intel . 2008-04-05 . dead . https://web.archive.org/web/20080407095317/http://softwareprojects.intel.com/avx/ . 2008-04-07 .
  2. Web site: Intel Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode – Rev 2.02. https://web.archive.org/web/20190806061845/https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf. Intel. Shay Gueron. Michael E. Kounavis. 2014-04-20. 2019-08-06.
  3. Web site: Fast CRC Computation for Generic Polynomials Using PCLMULQDQ.
  4. Web site: Fighting Cancer: The Unexpected Benefit Of Open Sourcing Our Code. Vlad Krasnov. CloudFlare. 2015-07-08. 2016-09-04.
  5. Web site: The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads . 3 . Johan De Gelas . . 2017-03-31.
  6. Web site: Slide detailing improvements of Jaguar over Bobcat . AMD . August 3, 2013.
  7. Web site: Striking a balance . 6 May 2009 . Dave Christie . AMD Developer blogs . 2011-03-11 . dead . https://archive.today/20131109140737/http://developer.amd.com/2009/05/06/striking-a-balance/ . 9 November 2013 .