CLMUL instruction set explained

Carry-less Multiplication (CLMUL) is an extension to the x86 instruction set used by microprocessors from Intel and AMD which was proposed by Intel in March 2008^[1] and made available in the Intel Westmere processors announced in early 2010. Mathematically, the instruction implements multiplication of polynomials over the finite field GF(2) where the bitstring

a_0a_1\ldotsa₆₃

represents the polynomial

a₀+a_1X+

	2
a
	2X

+ … +a₆₃X⁶³

. The CLMUL instruction also allows a more efficient implementation of the closely related multiplication of larger finite fields GF(2^k) than the traditional instruction set.^[2]

One use of these instructions is to improve the speed of applications doing block cipher encryption in Galois/Counter Mode, which depends on finite field GF(2^k) multiplication. Another application is the fast calculation of CRC values,^[3] including those used to implement the LZ77 sliding window DEFLATE algorithm in zlib and pngcrush.^[4]

ARMv8 also has a version of CLMUL. SPARC calls their version XMULX, for "XOR multiplication".

New instructions

The instruction computes the 128-bit carry-less product of two 64-bit values. The destination is a 128-bit XMM register. The source may be another XMM register or memory. An immediate operand specifies which halves of the 128-bit operands are multiplied. Mnemonics specifying specific values of the immediate operand are also defined:

Instruction	Opcode	Description
		Perform a carry-less multiplication of two 64-bit polynomials over the finite field GF(2)[''X''].
`PCLMULLQLQDQ xmmreg,xmmrm`	`[rm:  66 0f 3a 44 /r 00]`	Multiply the low halves of the two registers.
`PCLMULHQLQDQ xmmreg,xmmrm`	`[rm:  66 0f 3a 44 /r 01]`	Multiply the high half of the destination register by the low half of the source register.
`PCLMULLQHQDQ xmmreg,xmmrm`	`[rm:  66 0f 3a 44 /r 10]`	Multiply the low half of the destination register by the high half of the source register.
`PCLMULHQHQDQ xmmreg,xmmrm`	`[rm:  66 0f 3a 44 /r 11]`	Multiply the high halves of the two registers.

A EVEX vectorized version (VPCLMULQDQ) is seen in AVX-512.

CPUs with CLMUL instruction set

Intel
- Westmere processor (March 2010).
- Sandy Bridge processor
- Ivy Bridge processor
- Haswell processor
- Broadwell processor (with increased throughput and lower latency^[5])
- Skylake (and later) processor
- Goldmont processor
AMD:
- Jaguar-based processors and newer ^[6]
- Puma-based processors and newer
- "Heavy Equipment" processors
  - Bulldozer-based processors ^[7]
  - Piledriver-based processors
  - Steamroller-based processors
  - Excavator-based processors and newer
- Zen processors
- Zen+ processors
- Zen2 (and later) processors

The presence of the CLMUL instruction set can be checked by testing one of the CPU feature bits.

Notes and References

Web site: Intel Software Network . Intel . 2008-04-05 . dead . https://web.archive.org/web/20080407095317/http://softwareprojects.intel.com/avx/ . 2008-04-07 .
Web site: Intel Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode – Rev 2.02. https://web.archive.org/web/20190806061845/https://software.intel.com/sites/default/files/managed/72/cc/clmul-wp-rev-2.02-2014-04-20.pdf. Intel. Shay Gueron. Michael E. Kounavis. 2014-04-20. 2019-08-06.
Web site: Fast CRC Computation for Generic Polynomials Using PCLMULQDQ.
Web site: Fighting Cancer: The Unexpected Benefit Of Open Sourcing Our Code. Vlad Krasnov. CloudFlare. 2015-07-08. 2016-09-04.
Web site: The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads . 3 . Johan De Gelas . . 2017-03-31.
Web site: Slide detailing improvements of Jaguar over Bobcat . AMD . August 3, 2013.
Web site: Striking a balance . 6 May 2009 . Dave Christie . AMD Developer blogs . 2011-03-11 . dead . https://archive.today/20131109140737/http://developer.amd.com/2009/05/06/striking-a-balance/ . 9 November 2013 .

CLMUL instruction set explained

New instructions

CPUs with CLMUL instruction set

See also

Notes and References