X86 SIMD instruction listings explained

The x86 instruction set has several times been extended with SIMD (Single instruction, multiple data) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.

Summary of SIMD extensions

The main SIMD instruction set extensions that have been introduced for x86 are:

Year	Description	Added in
1997	A set of 57 integer SIMD instruction acting on 64-bit vectors, mostly providing 8/16/32-bit lane-width operations. Repurposed the old x87 FPU register-file as a bank of eight 64-bit vector registers, referred to as MM0..MM7 when used for MMX instructions.	Intel Pentium II, AMD K6, Rise mP6, IDT WinChip C6, Transmeta Crusoe
1999	"Katmai New Instructions": a set of 70 SIMD instructions acting on 128-bit vectors, mostly providing scalar and vector operations on 32-bit floating-point values. Introduced a new set of eight vector registers XMM0..XMM7, each 128 bits, and a status/control register MXCSR. This set of eight vector registers would later be extended to 16 registers with the introduction of x86-64.	Intel Pentium III, AMD Athlon XP, VIA C3 "Nehemiah", Transmeta Efficeon
2002	Extended SSE with 144 new instructions - mainly additional instructions to work on scalars and vectors of 64-bit floating-point values, as well as 128-bit-vector forms of most of the MMX integer instructions.	Intel Pentium 4, Intel Pentium M, Athlon 64, VIA C7, Transmeta Efficeon
2004	"Prescott New Instructions": added a set of 13 new instructions, mostly horizontal add/subtract operations.	Intel Pentium 4 "Prescott", Intel Core "Yonah", Athlon 64 "Venice", Transmeta Efficeon 8800, VIA C7
2006	Added a set of 32 new instructions to extend MMX and SSE, including a byte-shuffle instruction.	Intel Core 2 "Merom", AMD FX "Bulldozer", VIA Nano 2000
2007	Added a set of 47 instructions, including variants of integer min/max, widening integer conversions, vector lane insert/extract, and dot-product instructions.	Intel Core 2 "Penryn", AMD FX "Bulldozer" VIA Nano 3000
2008	Added a set of 7 instructions, mostly pertaining to string processing.	Intel Core i7 "Nehalem", AMD FX "Bulldozer", VIA Nano QuadCore C4000
2011	Extended the XMM0..XMM15 vector registers to 256-bit registers, referred to as YMM0..YMM15 when used as full 256-bit registers. Added three-operand variants of most of the SSE1-4 vector instructions, as well as 256-bit vector variants of most of the SSE1-4 vector instructions acting on 32/64-bit floating-point values. These new instruction variants are all encoded with the new VEX prefix.	Intel Core i7 "Sandy Bridge", AMD FX "Bulldozer", VIA Nano QuadCore C4000
2013	Added three-operand floating-point fused-multiply add operations, scalar and vector variants.	Intel Core i7 "Haswell", AMD FX "Piledriver", Zhaoxin Yongfeng
2013	Added 256-bit vector variants of most of the MMX/SSE1-4 vector integer instructions. Also adds vector gather instructions.	Intel Core i7 "Haswell", AMD FX "Excavator", VIA Nano QuadCore C4000
2016	Extended the YMM0..YMM15 vector registers to a set of 32 registers, each 512-bits wide - referred to as ZMM0..ZMM31 when used as 512-bit registers. Also added eight opmask registers K0..K7. Added 512-bit versions of most of the MMX/SSE/AVX vector instructions, as well as a substantial number of additional instructions. These are mostly encoded with the new EVEX prefix (except for opmask management instructions, which continue to use the VEX prefix.) Added the ability to perform per-vector-lane masking of the operation of most of its vector instructions, by using the opmask registers. Also added embedded rounding controls for floating-point instructions and a scalar-to-vector broadcast function for most instructions that can accept memory operands.	(See AVX-512#New instructions by sets for additional subsets.)
2023	Added a set of eight new tile registers, referred to as TMM0..TMM7. Each of these tile registers has a size of 8192 bits (16 rows of 64 bytes each). Also added instructions to perform matrix multiplication on these registers with various data formats.
2024	Reformulation of AVX-512 that includes most of the optional AVX-512 subsets as baseline functionality, but also allows for implementations to reduce their maximum supported vector-register width to 256 bits.	Intel Xeon 6 "Granite Rapids"
(2025)	Adds support for rounding modifiers for 256-bit floating-point numbers, as well as a handful of added instructions.	(Intel Diamond Rapids)

MMX instructions

MMX instructions operate on the mm registers, which are 64 bits wide. They are shared with the FPU registers.

Original MMX instructions

Added with Pentium MMX

Instruction	Opcode	Meaning	Notes
EMMS	0F 77	Empty MMX Technology State	Marks all x87 FPU registers for use by FPU
MOVD mm, r/m32	0F 6E /r	Move doubleword
MOVD r/m32, mm	0F 7E /r	Move doubleword
MOVQ mm/m64, mm	0F 7F /r	Move quadword
MOVQ mm, mm/m64	0F 6F /r	Move quadword
MOVQ mm, r/m64		Move quadword
MOVQ r/m64, mm	REX.W + 0F 7E /r	Move quadword
	0F 6B /r	Pack doublewords to words (signed with saturation)
PACKSSWB mm1, mm2/m64	0F 63 /r	Pack words to bytes (signed with saturation)
PACKUSWB mm, mm/m64	0F 67 /r	Pack words to bytes (unsigned with saturation)
PADDB mm, mm/m64	0F FC /r	Add packed byte integers
PADDW mm, mm/m64	0F FD /r	Add packed word integers
PADDD mm, mm/m64	0F FE /r	Add packed doubleword integers
PADDSB mm, mm/m64	0F EC /r	Add packed signed byte integers and saturate
PADDSW mm, mm/m64	0F ED /r	Add packed signed word integers and saturate
PADDUSB mm, mm/m64	0F DC /r	Add packed unsigned byte integers and saturate
PADDUSW mm, mm/m64	0F DD /r	Add packed unsigned word integers and saturate
PAND mm, mm/m64	0F DB /r	Bitwise AND
PANDN mm, mm/m64	0F DF /r	Bitwise AND NOT
POR mm, mm/m64	0F EB /r	Bitwise OR
PXOR mm, mm/m64	0F EF /r	Bitwise XOR
PCMPEQB mm, mm/m64	0F 74 /r	Compare packed bytes for equality
PCMPEQW mm, mm/m64	0F 75 /r	Compare packed words for equality
PCMPEQD mm, mm/m64	0F 76 /r	Compare packed doublewords for equality
PCMPGTB mm, mm/m64	0F 64 /r	Compare packed signed byte integers for greater than
PCMPGTW mm, mm/m64	0F 65 /r	Compare packed signed word integers for greater than
PCMPGTD mm, mm/m64	0F 66 /r	Compare packed signed doubleword integers for greater than
PMADDWD mm, mm/m64	0F F5 /r	Multiply packed words, add adjacent doubleword results
PMULHW mm, mm/m64	0F E5 /r	Multiply packed signed word integers, store high 16 bits of results
PMULLW mm, mm/m64	0F D5 /r	Multiply packed signed word integers, store low 16 bits of results
PSLLW mm1, imm8	0F 71 /6 ib	Shift left words, shift in zeros
PSLLW mm, mm/m64	0F F1 /r	Shift left words, shift in zeros
PSLLD mm, imm8	0F 72 /6 ib	Shift left doublewords, shift in zeros
PSLLD mm, mm/m64	0F F2 /r	Shift left doublewords, shift in zeros
PSLLQ mm, imm8	0F 73 /6 ib	Shift left quadword, shift in zeros
PSLLQ mm, mm/m64	0F F3 /r	Shift left quadword, shift in zeros
PSRAD mm, imm8	0F 72 /4 ib	Shift right doublewords, shift in sign bits
PSRAD mm, mm/m64	0F E2 /r	Shift right doublewords, shift in sign bits
PSRAW mm, imm8	0F 71 /4 ib	Shift right words, shift in sign bits
PSRAW mm, mm/m64	0F E1 /r	Shift right words, shift in sign bits
PSRLW mm, imm8	0F 71 /2 ib	Shift right words, shift in zeros
PSRLW mm, mm/m64	0F D1 /r	Shift right words, shift in zeros
PSRLD mm, imm8	0F 72 /2 ib	Shift right doublewords, shift in zeros
PSRLD mm, mm/m64	0F D2 /r	Shift right doublewords, shift in zeros
PSRLQ mm, imm8	0F 73 /2 ib	Shift right quadword, shift in zeros
PSRLQ mm, mm/m64	0F D3 /r	Shift right quadword, shift in zeros
PSUBB mm, mm/m64	0F F8 /r	Subtract packed byte integers
PSUBW mm, mm/m64	0F F9 /r	Subtract packed word integers
PSUBD mm, mm/m64	0F FA /r	Subtract packed doubleword integers
PSUBSB mm, mm/m64	0F E8 /r	Subtract signed packed bytes with saturation
PSUBSW mm, mm/m64	0F E9 /r	Subtract signed packed words with saturation
PSUBUSB mm, mm/m64	0F D8 /r	Subtract unsigned packed bytes with saturation
PSUBUSW mm, mm/m64	0F D9 /r	Subtract unsigned packed words with saturation
PUNPCKHBW mm, mm/m64	0F 68 /r	Unpack and interleave high-order bytes
PUNPCKHWD mm, mm/m64	0F 69 /r	Unpack and interleave high-order words
PUNPCKHDQ mm, mm/m64	0F 6A /r	Unpack and interleave high-order doublewords
PUNPCKLBW mm, mm/m32	0F 60 /r	Unpack and interleave low-order bytes
PUNPCKLWD mm, mm/m32	0F 61 /r	Unpack and interleave low-order words
PUNPCKLDQ mm, mm/m32	0F 62 /r	Unpack and interleave low-order doublewords

MMX instructions added in specific processors

MMX instructions added with MMX+ and SSE

The following MMX instruction were added with SSE. They are also available on the Athlon under the name MMX+.

Instruction	Opcode	Meaning
MASKMOVQ mm1, mm2	0F F7 /r	Masked Move of Quadword
MOVNTQ m64, mm	0F E7 /r	Move Quadword Using Non-Temporal Hint
		Shuffle Packed Words
PINSRW mm, r32/m16, imm8	0F C4 /r	Insert Word
PEXTRW reg, mm, imm8	0F C5 /r	Extract Word
PMOVMSKB reg, mm	0F D7 /r	Move Byte Mask
PMINUB mm1, mm2/m64	0F DA /r	Minimum of Packed Unsigned Byte Integers
PMAXUB mm1, mm2/m64	0F DE /r	Maximum of Packed Unsigned Byte Integers
PAVGB mm1, mm2/m64	0F E0 /r	Average Packed Integers
PAVGW mm1, mm2/m64	0F E3 /r	Average Packed Integers
PMULHUW mm1, mm2/m64	0F E4 /r	Multiply Packed Unsigned Integers and Store High Result
PMINSW mm1, mm2/m64	0F EA /r	Minimum of Packed Signed Word Integers
PMAXSW mm1, mm2/m64	0F EE /r	Maximum of Packed Signed Word Integers
PSADBW mm1, mm2/m64	0F F6 /r	Compute Sum of Absolute Differences

MMX instructions added with SSE2

The following MMX instructions were added with SSE2:

Instruction	Opcode	Meaning
PADDQ mm, mm/m64	0F D4 /r	Add packed quadword integers
PSUBQ mm1, mm2/m64	0F FB /r	Subtract packed quadword integers
PMULUDQ mm1, mm2/m64	0F F4 /r	Multiply unsigned doubleword integer

MMX instructions added with SSSE3

Instruction	Opcode	Meaning
PSIGNB mm1, mm2/m64	0F 38 08 /r	Negate/zero/preserve packed byte integers depending on corresponding sign
PSIGNW mm1, mm2/m64	0F 38 09 /r	Negate/zero/preserve packed word integers depending on corresponding sign
PSIGND mm1, mm2/m64	0F 38 0A /r	Negate/zero/preserve packed doubleword integers depending on corresponding sign
PSHUFB mm1, mm2/m64	0F 38 00 /r	Shuffle bytes
PMULHRSW mm1, mm2/m64	0F 38 0B /r	Multiply 16-bit signed words, scale and round signed doublewords, pack high 16 bits
PMADDUBSW mm1, mm2/m64	0F 38 04 /r	Multiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed-words
PHSUBW mm1, mm2/m64	0F 38 05 /r	Subtract and pack 16-bit signed integers horizontally
PHSUBSW mm1, mm2/m64	0F 38 07 /r	Subtract and pack 16-bit signed integer horizontally with saturation
PHSUBD mm1, mm2/m64	0F 38 06 /r	Subtract and pack 32-bit signed integers horizontally
PHADDSW mm1, mm2/m64	0F 38 03 /r	Add and pack 16-bit signed integers horizontally, pack saturated integers to mm1.
PHADDW mm1, mm2/m64	0F 38 01 /r	Add and pack 16-bit integers horizontally
PHADDD mm1, mm2/m64	0F 38 02 /r	Add and pack 32-bit integers horizontally
		Concatenate destination and source operands, extract byte-aligned result shifted to the right
PABSB mm1, mm2/m64	0F 38 1C /r	Compute the absolute value of bytes and store unsigned result
PABSW mm1, mm2/m64	0F 38 1D /r	Compute the absolute value of 16-bit integers and store unsigned result
PABSD mm1, mm2/m64	0F 38 1E /r	Compute the absolute value of 32-bit integers and store unsigned result

SSE instructions

Added with Pentium III

SSE instructions operate on xmm registers, which are 128 bit wide.

SSE consists of the following SSE SIMD floating-point instructions:

Instruction	Opcode	Meaning
ANDPS* xmm1, xmm2/m128	0F 54 /r	Bitwise Logical AND of Packed Single-Precision Floating-Point Values
ANDNPS* xmm1, xmm2/m128	0F 55 /r	Bitwise Logical AND NOT of Packed Single-Precision Floating-Point Values
ORPS* xmm1, xmm2/m128	0F 56 /r	Bitwise Logical OR of Single-Precision Floating-Point Values
XORPS* xmm1, xmm2/m128	0F 57 /r	Bitwise Logical XOR for Single-Precision Floating-Point Values
MOVUPS xmm1, xmm2/m128	0F 10 /r	Move Unaligned Packed Single-Precision Floating-Point Values
MOVSS xmm1, xmm2/m32	F3 0F 10 /r	Move Scalar Single-Precision Floating-Point Values
MOVUPS xmm2/m128, xmm1	0F 11 /r	Move Unaligned Packed Single-Precision Floating-Point Values
MOVSS xmm2/m32, xmm1	F3 0F 11 /r	Move Scalar Single-Precision Floating-Point Values
MOVLPS xmm, m64	0F 12 /r	Move Low Packed Single-Precision Floating-Point Values
MOVHLPS xmm1, xmm2	0F 12 /r	Move Packed Single-Precision Floating-Point Values High to Low
MOVLPS m64, xmm	0F 13 /r	Move Low Packed Single-Precision Floating-Point Values
UNPCKLPS xmm1, xmm2/m128	0F 14 /r	Unpack and Interleave Low Packed Single-Precision Floating-Point Values
UNPCKHPS xmm1, xmm2/m128	0F 15 /r	Unpack and Interleave High Packed Single-Precision Floating-Point Values
MOVHPS xmm, m64	0F 16 /r	Move High Packed Single-Precision Floating-Point Values
MOVLHPS xmm1, xmm2	0F 16 /r	Move Packed Single-Precision Floating-Point Values Low to High
MOVHPS m64, xmm	0F 17 /r	Move High Packed Single-Precision Floating-Point Values
MOVAPS xmm1, xmm2/m128	0F 28 /r	Move Aligned Packed Single-Precision Floating-Point Values
MOVAPS xmm2/m128, xmm1	0F 29 /r	Move Aligned Packed Single-Precision Floating-Point Values
MOVNTPS m128, xmm1	0F 2B /r	Move Aligned Four Packed Single-FP Non Temporal
MOVMSKPS reg, xmm	0F 50 /r	Extract Packed Single-Precision Floating-Point 4-bit Sign Mask. The upper bits of the register are filled with zeros.
CVTPI2PS xmm, mm/m64	0F 2A /r	Convert Packed Dword Integers to Packed Single-Precision FP Values
CVTSI2SS xmm, r/m32	F3 0F 2A /r	Convert Dword Integer to Scalar Single-Precision FP Value
CVTSI2SS xmm, r/m64	F3 REX.W 0F 2A /r	Convert Qword Integer to Scalar Single-Precision FP Value
CVTTPS2PI mm, xmm/m64	0F 2C /r	Convert with Truncation Packed Single-Precision FP Values to Packed Dword Integers
CVTTSS2SI r32, xmm/m32	F3 0F 2C /r	Convert with Truncation Scalar Single-Precision FP Value to Dword Integer
CVTTSS2SI r64, xmm1/m32	F3 REX.W 0F 2C /r	Convert with Truncation Scalar Single-Precision FP Value to Qword Integer
CVTPS2PI mm, xmm/m64	0F 2D /r	Convert Packed Single-Precision FP Values to Packed Dword Integers
CVTSS2SI r32, xmm/m32	F3 0F 2D /r	Convert Scalar Single-Precision FP Value to Dword Integer
CVTSS2SI r64, xmm1/m32		Convert Scalar Single-Precision FP Value to Qword Integer
UCOMISS xmm1, xmm2/m32	0F 2E /r	Unordered Compare Scalar Single-Precision Floating-Point Values and Set EFLAGS
COMISS xmm1, xmm2/m32	0F 2F /r	Compare Scalar Ordered Single-Precision Floating-Point Values and Set EFLAGS
SQRTPS xmm1, xmm2/m128	0F 51 /r	Compute Square Roots of Packed Single-Precision Floating-Point Values
SQRTSS xmm1, xmm2/m32	F3 0F 51 /r	Compute Square Root of Scalar Single-Precision Floating-Point Value
RSQRTPS xmm1, xmm2/m128	0F 52 /r	Compute Reciprocal of Square Root of Packed Single-Precision Floating-Point Value
RSQRTSS xmm1, xmm2/m32	F3 0F 52 /r	Compute Reciprocal of Square Root of Scalar Single-Precision Floating-Point Value
RCPPS xmm1, xmm2/m128	0F 53 /r	Compute Reciprocal of Packed Single-Precision Floating-Point Values
RCPSS xmm1, xmm2/m32	F3 0F 53 /r	Compute Reciprocal of Scalar Single-Precision Floating-Point Values
ADDPS xmm1, xmm2/m128	0F 58 /r	Add Packed Single-Precision Floating-Point Values
ADDSS xmm1, xmm2/m32	F3 0F 58 /r	Add Scalar Single-Precision Floating-Point Values
MULPS xmm1, xmm2/m128	0F 59 /r	Multiply Packed Single-Precision Floating-Point Values
MULSS xmm1, xmm2/m32	F3 0F 59 /r	Multiply Scalar Single-Precision Floating-Point Values
SUBPS xmm1, xmm2/m128	0F 5C /r	Subtract Packed Single-Precision Floating-Point Values
SUBSS xmm1, xmm2/m32	F3 0F 5C /r	Subtract Scalar Single-Precision Floating-Point Values
MINPS xmm1, xmm2/m128	0F 5D /r	Return Minimum Packed Single-Precision Floating-Point Values
MINSS xmm1, xmm2/m32	F3 0F 5D /r	Return Minimum Scalar Single-Precision Floating-Point Values
DIVPS xmm1, xmm2/m128	0F 5E /r	Divide Packed Single-Precision Floating-Point Values
DIVSS xmm1, xmm2/m32	F3 0F 5E /r	Divide Scalar Single-Precision Floating-Point Values
MAXPS xmm1, xmm2/m128	0F 5F /r	Return Maximum Packed Single-Precision Floating-Point Values
MAXSS xmm1, xmm2/m32	F3 0F 5F /r	Return Maximum Scalar Single-Precision Floating-Point Values
LDMXCSR m32	0F AE /2	Load MXCSR Register State
STMXCSR m32	0F AE /3	Store MXCSR Register State
CMPPS xmm1, xmm2/m128, imm8	0F C2 /r ib	Compare Packed Single-Precision Floating-Point Values
CMPSS xmm1, xmm2/m32, imm8	F3 0F C2 /r ib	Compare Scalar Single-Precision Floating-Point Values
	0F C6 /r ib	Shuffle Packed Single-Precision Floating-Point Values

* The floating point single bitwise operations ANDPS, ANDNPS, ORPS and XORPS produce the same result as the SSE2 integer (PAND, PANDN, POR, PXOR) and double ones (ANDPD, ANDNPD, ORPD, XORPD), but can introduce extra latency for domain changes when applied values of the wrong type.^[1]

SSE2 instructions

Added with Pentium 4

SSE2 SIMD floating-point instructions

SSE2 data movement instructions

Instruction	Opcode	Meaning
MOVAPD xmm1, xmm2/m128	66 0F 28 /r	Move Aligned Packed Double-Precision Floating-Point Values
MOVAPD xmm2/m128, xmm1	66 0F 29 /r	Move Aligned Packed Double-Precision Floating-Point Values
MOVNTPD m128, xmm1	66 0F 2B /r	Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint
MOVHPD xmm1, m64	66 0F 16 /r	Move High Packed Double-Precision Floating-Point Value
MOVHPD m64, xmm1	66 0F 17 /r	Move High Packed Double-Precision Floating-Point Value
MOVLPD xmm1, m64	66 0F 12 /r	Move Low Packed Double-Precision Floating-Point Value
MOVLPD m64, xmm1	66 0F 13/r	Move Low Packed Double-Precision Floating-Point Value
MOVUPD xmm1, xmm2/m128	66 0F 10 /r	Move Unaligned Packed Double-Precision Floating-Point Values
MOVUPD xmm2/m128, xmm1	66 0F 11 /r	Move Unaligned Packed Double-Precision Floating-Point Values
MOVMSKPD reg, xmm	66 0F 50 /r	Extract Packed Double-Precision Floating-Point Sign Mask
MOVSD* xmm1, xmm2/m64	F2 0F 10 /r	Move or Merge Scalar Double-Precision Floating-Point Value
MOVSD xmm1/m64, xmm2	F2 0F 11 /r	Move or Merge Scalar Double-Precision Floating-Point Value

SSE2 packed arithmetic instructions

Instruction	Opcode	Meaning
ADDPD xmm1, xmm2/m128	66 0F 58 /r	Add Packed Double-Precision Floating-Point Values
ADDSD xmm1, xmm2/m64	F2 0F 58 /r	Add Low Double-Precision Floating-Point Value
DIVPD xmm1, xmm2/m128	66 0F 5E /r	Divide Packed Double-Precision Floating-Point Values
DIVSD xmm1, xmm2/m64	F2 0F 5E /r	Divide Scalar Double-Precision Floating-Point Value
MAXPD xmm1, xmm2/m128	66 0F 5F /r	Maximum of Packed Double-Precision Floating-Point Values
MAXSD xmm1, xmm2/m64	F2 0F 5F /r	Return Maximum Scalar Double-Precision Floating-Point Value
MINPD xmm1, xmm2/m128	66 0F 5D /r	Minimum of Packed Double-Precision Floating-Point Values
MINSD xmm1, xmm2/m64	F2 0F 5D /r	Return Minimum Scalar Double-Precision Floating-Point Value
MULPD xmm1, xmm2/m128	66 0F 59 /r	Multiply Packed Double-Precision Floating-Point Values
MULSD xmm1,xmm2/m64	F2 0F 59 /r	Multiply Scalar Double-Precision Floating-Point Value
SQRTPD xmm1, xmm2/m128	66 0F 51 /r	Square Root of Double-Precision Floating-Point Values
SQRTSD xmm1,xmm2/m64	F2 0F 51/r	Compute Square Root of Scalar Double-Precision Floating-Point Value
SUBPD xmm1, xmm2/m128	66 0F 5C /r	Subtract Packed Double-Precision Floating-Point Values
SUBSD xmm1, xmm2/m64	F2 0F 5C /r	Subtract Scalar Double-Precision Floating-Point Value

SSE2 logical instructions

Instruction	Opcode	Meaning
ANDPD xmm1, xmm2/m128	66 0F 54 /r	Bitwise Logical AND of Packed Double Precision Floating-Point Values
ANDNPD xmm1, xmm2/m128	66 0F 55 /r	Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values
ORPD xmm1, xmm2/m128	66 0F 56/r	Bitwise Logical OR of Packed Double Precision Floating-Point Values
XORPD xmm1, xmm2/m128	66 0F 57/r	Bitwise Logical XOR of Packed Double Precision Floating-Point Values

SSE2 compare instructions

Instruction	Opcode	Meaning
CMPPD xmm1, xmm2/m128, imm8	66 0F C2 /r ib	Compare Packed Double-Precision Floating-Point Values
CMPSD* xmm1, xmm2/m64, imm8	F2 0F C2 /r ib	Compare Low Double-Precision Floating-Point Values
COMISD xmm1, xmm2/m64	66 0F 2F /r	Compare Scalar Ordered Double-Precision Floating-Point Values and Set EFLAGS
UCOMISD xmm1, xmm2/m64	66 0F 2E /r	Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS

SSE2 shuffle and unpack instructions

Instruction	Opcode	Meaning
SHUFPD xmm1, xmm2/m128, imm8	66 0F C6 /r ib	Packed Interleave Shuffle of Pairs of Double-Precision Floating-Point Values
UNPCKHPD xmm1, xmm2/m128	66 0F 15 /r	Unpack and Interleave High Packed Double-Precision Floating-Point Values
UNPCKLPD xmm1, xmm2/m128	66 0F 14 /r	Unpack and Interleave Low Packed Double-Precision Floating-Point Values

SSE2 conversion instructions

Instruction	Opcode	Meaning
CVTDQ2PD xmm1, xmm2/m64	F3 0F E6 /r	Convert Packed Doubleword Integers to Packed Double-Precision Floating-Point Values
CVTDQ2PS xmm1, xmm2/m128	0F 5B /r	Convert Packed Doubleword Integers to Packed Single-Precision Floating-Point Values
CVTPD2DQ xmm1, xmm2/m128	F2 0F E6 /r	Convert Packed Double-Precision Floating-Point Values to Packed Doubleword Integers
CVTPD2PI mm, xmm/m128	66 0F 2D /r	Convert Packed Double-Precision FP Values to Packed Dword Integers
CVTPD2PS xmm1, xmm2/m128	66 0F 5A /r	Convert Packed Double-Precision Floating-Point Values to Packed Single-Precision Floating-Point Values
CVTPI2PD xmm, mm/m64	66 0F 2A /r	Convert Packed Dword Integers to Packed Double-Precision FP Values
CVTPS2DQ xmm1, xmm2/m128	66 0F 5B /r	Convert Packed Single-Precision Floating-Point Values to Packed Signed Doubleword Integer Values
CVTPS2PD xmm1, xmm2/m64	0F 5A /r	Convert Packed Single-Precision Floating-Point Values to Packed Double-Precision Floating-Point Values
CVTSD2SI r32, xmm1/m64	F2 0F 2D /r	Convert Scalar Double-Precision Floating-Point Value to Doubleword Integer
CVTSD2SI r64, xmm1/m64	F2 REX.W 0F 2D /r	Convert Scalar Double-Precision Floating-Point Value to Quadword Integer With Sign Extension
CVTSD2SS xmm1, xmm2/m64	F2 0F 5A /r	Convert Scalar Double-Precision Floating-Point Value to Scalar Single-Precision Floating-Point Value
CVTSI2SD xmm1, r32/m32	F2 0F 2A /r	Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value
CVTSI2SD xmm1, r/m64	F2 REX.W 0F 2A /r	Convert Quadword Integer to Scalar Double-Precision Floating-Point value
CVTSS2SD xmm1, xmm2/m32	F3 0F 5A /r	Convert Scalar Single-Precision Floating-Point Value to Scalar Double-Precision Floating-Point Value
CVTTPD2DQ xmm1, xmm2/m128	66 0F E6 /r	Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Doubleword Integers
CVTTPD2PI mm, xmm/m128	66 0F 2C /r	Convert with Truncation Packed Double-Precision FP Values to Packed Dword Integers
CVTTPS2DQ xmm1, xmm2/m128	F3 0F 5B /r	Convert with Truncation Packed Single-Precision Floating-Point Values to Packed Signed Doubleword Integer Values
CVTTSD2SI r32, xmm1/m64	F2 0F 2C /r	Convert with Truncation Scalar Double-Precision Floating-Point Value to Signed Dword Integer
CVTTSD2SI r64, xmm1/m64	F2 REX.W 0F 2C /r	Convert with Truncation Scalar Double-Precision Floating-Point Value To Signed Qword Integer

CMPSD and MOVSD have the same name as the string instruction mnemonics CMPSD (CMPS) and MOVSD (MOVS); however, the former refer to scalar double-precision floating-points whereas the latter refer to doubleword strings. Assemblers disambiguate them based on the presence or absence of operands.

SSE2 SIMD integer instructions

SSE2 MMX-like instructions extended to SSE registers

SSE2 allows execution of MMX instructions on SSE registers, processing twice the amount of data at once.

Instruction	Opcode	Meaning
		Move doubleword
		Move doubleword
		Move quadword
		Move quadword
	66 REX.W 0F 7E /r	Move quadword
		Move quadword
		Move a byte mask, zeroing the upper bits of the register
		Extract specified word and move it to reg, setting bits 15-0 and zeroing the rest
		Move low word at the specified word position
		Converts 4 packed signed doubleword integers into 8 packed signed word integers with saturation
		Converts 8 packed signed word integers into 16 packed signed byte integers with saturation
		Converts 8 signed word integers into 16 unsigned byte integers with saturation
		Add packed byte integers
		Add packed word integers
		Add packed doubleword integers
		Add packed quadword integers.
		Add packed signed byte integers with saturation
		Add packed signed word integers with saturation
		Add packed unsigned byte integers with saturation
		Add packed unsigned word integers with saturation
		Bitwise AND
		Bitwise AND NOT
		Bitwise OR
		Bitwise XOR
		Compare packed bytes for equality.
		Compare packed words for equality.
		Compare packed doublewords for equality.
		Compare packed signed byte integers for greater than
		Compare packed signed word integers for greater than
		Compare packed signed doubleword integers for greater than
		Multiply packed signed word integers with saturation
		Multiply the packed signed word integers, store the high 16 bits of the results
		Multiply packed unsigned word integers, store the high 16 bits of the results
		Multiply packed unsigned doubleword integers
		Shift words left while shifting in 0s
		Shift words left while shifting in 0s
		Shift doublewords left while shifting in 0s
		Shift doublewords left while shifting in 0s
		Shift quadwords left while shifting in 0s
		Shift quadwords left while shifting in 0s
		Shift doubleword right while shifting in sign bits
		Shift doublewords right while shifting in sign bits
		Shift words right while shifting in sign bits
		Shift words right while shifting in sign bits
		Shift words right while shifting in 0s
		Shift words right while shifting in 0s
		Shift doublewords right while shifting in 0s
		Shift doublewords right while shifting in 0s
		Shift quadwords right while shifting in 0s
		Shift quadwords right while shifting in 0s
		Subtract packed byte integers
		Subtract packed word integers
		Subtract packed doubleword integers
		Subtract packed quadword integers.
		Subtract packed signed byte integers with saturation
		Subtract packed signed word integers with saturation
		Multiply the packed word integers, add adjacent doubleword results
		Subtract packed unsigned byte integers with saturation
		Subtract packed unsigned word integers with saturation
		Unpack and interleave high-order bytes
		Unpack and interleave high-order words
		Unpack and interleave high-order doublewords
		Interleave low-order bytes
		Interleave low-order words
		Interleave low-order doublewords
		Average packed unsigned byte integers with rounding
		Average packed unsigned word integers with rounding
		Compare packed unsigned byte integers and store packed minimum values
		Compare packed signed word integers and store packed minimum values
		Compare packed signed word integers and store maximum packed values
		Compare packed unsigned byte integers and store packed maximum values
		Computes the absolute differences of the packed unsigned byte integers; the 8 low differences and 8 high differences are then summed separately to produce two unsigned word integer results

SSE2 integer instructions for SSE registers only

The following instructions can be used only on SSE registers, since by their nature they do not work on MMX registers

Instruction	Opcode	Meaning
MASKMOVDQU xmm1, xmm2	66 0F F7 /r	Non-Temporal Store of Selected Bytes from an XMM Register into Memory
MOVDQ2Q mm, xmm	F2 0F D6 /r	Move low quadword from XMM to MMX register.
MOVDQA xmm1, xmm2/m128	66 0F 6F /r	Move aligned double quadword
MOVDQA xmm2/m128, xmm1	66 0F 7F /r	Move aligned double quadword
MOVDQU xmm1, xmm2/m128	F3 0F 6F /r	Move unaligned double quadword
MOVDQU xmm2/m128, xmm1	F3 0F 7F /r	Move unaligned double quadword
MOVQ2DQ xmm, mm	F3 0F D6 /r	Move quadword from MMX register to low quadword of XMM register
MOVNTDQ m128, xmm1	66 0F E7 /r	Store Packed Integers Using Non-Temporal Hint
PSHUFHW xmm1, xmm2/m128, imm8	F3 0F 70 /r ib	Shuffle packed high words.
PSHUFLW xmm1, xmm2/m128, imm8	F2 0F 70 /r ib	Shuffle packed low words.
PSHUFD xmm1, xmm2/m128, imm8	66 0F 70 /r ib	Shuffle packed doublewords.
PSLLDQ xmm1, imm8	66 0F 73 /7 ib	Packed shift left logical double quadwords.
PSRLDQ xmm1, imm8	66 0F 73 /3 ib	Packed shift right logical double quadwords.
PUNPCKHQDQ xmm1, xmm2/m128	66 0F 6D /r	Unpack and interleave high-order quadwords,
PUNPCKLQDQ xmm1, xmm2/m128	66 0F 6C /r	Interleave low quadwords,

SSE3 instructions

Added with Pentium 4 supporting SSE3

SSE3 SIMD floating-point instructions

Instruction	Opcode	Meaning	Notes
ADDSUBPS xmm1, xmm2/m128	F2 0F D0 /r	Add/subtract single-precision floating-point values	for Complex Arithmetic
ADDSUBPD xmm1, xmm2/m128	66 0F D0 /r	Add/subtract double-precision floating-point values
MOVDDUP xmm1, xmm2/m64	F2 0F 12 /r	Move double-precision floating-point value and duplicate
MOVSLDUP xmm1, xmm2/m128	F3 0F 12 /r	Move and duplicate even index single-precision floating-point values
MOVSHDUP xmm1, xmm2/m128	F3 0F 16 /r	Move and duplicate odd index single-precision floating-point values
HADDPS xmm1, xmm2/m128	F2 0F 7C /r	Horizontal add packed single-precision floating-point values	for Graphics
HADDPD xmm1, xmm2/m128	66 0F 7C /r	Horizontal add packed double-precision floating-point values
HSUBPS xmm1, xmm2/m128	F2 0F 7D /r	Horizontal subtract packed single-precision floating-point values
HSUBPD xmm1, xmm2/m128	66 0F 7D /r	Horizontal subtract packed double-precision floating-point values

SSE3 SIMD integer instructions

SSSE3 instructions

Added with Xeon 5100 series and initial Core 2

The following MMX-like instructions extended to SSE registers were added with SSSE3

Instruction	Opcode	Meaning
PSIGNB xmm1, xmm2/m128	66 0F 38 08 /r	Negate/zero/preserve packed byte integers depending on corresponding sign
PSIGNW xmm1, xmm2/m128	66 0F 38 09 /r	Negate/zero/preserve packed word integers depending on corresponding sign
PSIGND xmm1, xmm2/m128	66 0F 38 0A /r	Negate/zero/preserve packed doubleword integers depending on corresponding
PSHUFB xmm1, xmm2/m128	66 0F 38 00 /r	Shuffle bytes
PMULHRSW xmm1, xmm2/m128	66 0F 38 0B /r	Multiply 16-bit signed words, scale and round signed doublewords, pack high 16 bits
PMADDUBSW xmm1, xmm2/m128	66 0F 38 04 /r	Multiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed-words
PHSUBW xmm1, xmm2/m128	66 0F 38 05 /r	Subtract and pack 16-bit signed integers horizontally
PHSUBSW xmm1, xmm2/m128	66 0F 38 07 /r	Subtract and pack 16-bit signed integer horizontally with saturation
PHSUBD xmm1, xmm2/m128	66 0F 38 06 /r	Subtract and pack 32-bit signed integers horizontally
PHADDSW xmm1, xmm2/m128	66 0F 38 03 /r	Add and pack 16-bit signed integers horizontally with saturation
PHADDW xmm1, xmm2/m128	66 0F 38 01 /r	Add and pack 16-bit integers horizontally
PHADDD xmm1, xmm2/m128	66 0F 38 02 /r	Add and pack 32-bit integers horizontally
PALIGNR xmm1, xmm2/m128, imm8	66 0F 3A 0F /r ib	Concatenate destination and source operands, extract byte-aligned result shifted to the right
PABSB xmm1, xmm2/m128	66 0F 38 1C /r	Compute the absolute value of bytes and store unsigned result
PABSW xmm1, xmm2/m128	66 0F 38 1D /r	Compute the absolute value of 16-bit integers and store unsigned result
PABSD xmm1, xmm2/m128	66 0F 38 1E /r	Compute the absolute value of 32-bit integers and store unsigned result

SSE4 instructions

Added with Core 2 manufactured in 45nm

SSE4.1 SIMD floating-point instructions

Instruction	Opcode	Meaning
DPPS xmm1, xmm2/m128, imm8	66 0F 3A 40 /r ib	Selectively multiply packed SP floating-point values, add and selectively store
DPPD xmm1, xmm2/m128, imm8	66 0F 3A 41 /r ib	Selectively multiply packed DP floating-point values, add and selectively store
BLENDPS xmm1, xmm2/m128, imm8	66 0F 3A 0C /r ib	Select packed single precision floating-point values from specified mask
BLENDVPS xmm1, xmm2/m128, <XMM0>	66 0F 38 14 /r	Select packed single precision floating-point values from specified mask
BLENDPD xmm1, xmm2/m128, imm8	66 0F 3A 0D /r ib	Select packed DP-FP values from specified mask
BLENDVPD xmm1, xmm2/m128, <XMM0>	66 0F 38 15 /r	Select packed DP FP values from specified mask
ROUNDPS xmm1, xmm2/m128, imm8	66 0F 3A 08 /r ib	Round packed single precision floating-point values
ROUNDSS xmm1, xmm2/m32, imm8	66 0F 3A 0A /r ib	Round the low packed single precision floating-point value
ROUNDPD xmm1, xmm2/m128, imm8	66 0F 3A 09 /r ib	Round packed double precision floating-point values
ROUNDSD xmm1, xmm2/m64, imm8	66 0F 3A 0B /r ib	Round the low packed double precision floating-point value
INSERTPS xmm1, xmm2/m32, imm8	66 0F 3A 21 /r ib	Insert a selected single-precision floating-point value at the specified destination element and zero out destination elements
EXTRACTPS reg/m32, xmm1, imm8	66 0F 3A 17 /r ib	Extract one single-precision floating-point value at specified offset and store the result (zero-extended, if applicable)

SSE4.1 SIMD integer instructions

Instruction	Opcode	Meaning
MPSADBW xmm1, xmm2/m128, imm8	66 0F 3A 42 /r ib	Sums absolute 8-bit integer difference of adjacent groups of 4 byte integers with starting offset
PHMINPOSUW xmm1, xmm2/m128	66 0F 38 41 /r	Find the minimum unsigned word
PMULLD xmm1, xmm2/m128	66 0F 38 40 /r	Multiply the packed dword signed integers and store the low 32 bits
PMULDQ xmm1, xmm2/m128	66 0F 38 28 /r	Multiply packed signed doubleword integers and store quadword result
PBLENDVB xmm1, xmm2/m128, <XMM0>	66 0F 38 10 /r	Select byte values from specified mask
PBLENDW xmm1, xmm2/m128, imm8	66 0F 3A 0E /r ib	Select words from specified mask
PMINSB xmm1, xmm2/m128	66 0F 38 38 /r	Compare packed signed byte integers
PMINUW xmm1, xmm2/m128	66 0F 38 3A/r	Compare packed unsigned word integers
PMINSD xmm1, xmm2/m128	66 0F 38 39 /r	Compare packed signed dword integers
PMINUD xmm1, xmm2/m128	66 0F 38 3B /r	Compare packed unsigned dword integers
PMAXSB xmm1, xmm2/m128	66 0F 38 3C /r	Compare packed signed byte integers
PMAXUW xmm1, xmm2/m128	66 0F 38 3E/r	Compare packed unsigned word integers
PMAXSD xmm1, xmm2/m128	66 0F 38 3D /r	Compare packed signed dword integers
PMAXUD xmm1, xmm2/m128	66 0F 38 3F /r	Compare packed unsigned dword integers
PINSRB xmm1, r32/m8, imm8	66 0F 3A 20 /r ib	Insert a byte integer value at specified destination element
PINSRD xmm1, r/m32, imm8	66 0F 3A 22 /r ib	Insert a dword integer value at specified destination element
PINSRQ xmm1, r/m64, imm8	66 REX.W 0F 3A 22 /r ib	Insert a qword integer value at specified destination element
PEXTRB reg/m8, xmm2, imm8	66 0F 3A 14 /r ib	Extract a byte integer value at source byte offset, upper bits are zeroed.
PEXTRW reg/m16, xmm, imm8	66 0F 3A 15 /r ib	Extract word and copy to lowest 16 bits, zero-extended
PEXTRD r/m32, xmm2, imm8	66 0F 3A 16 /r ib	Extract a dword integer value at source dword offset
PEXTRQ r/m64, xmm2, imm8	66 REX.W 0F 3A 16 /r ib	Extract a qword integer value at source qword offset
PMOVSXBW xmm1, xmm2/m64	66 0f 38 20 /r	Sign extend 8 packed 8-bit integers to 8 packed 16-bit integers
PMOVZXBW xmm1, xmm2/m64	66 0f 38 30 /r	Zero extend 8 packed 8-bit integers to 8 packed 16-bit integers
PMOVSXBD xmm1, xmm2/m32	66 0f 38 21 /r	Sign extend 4 packed 8-bit integers to 4 packed 32-bit integers
PMOVZXBD xmm1, xmm2/m32	66 0f 38 31 /r	Zero extend 4 packed 8-bit integers to 4 packed 32-bit integers
PMOVSXBQ xmm1, xmm2/m16	66 0f 38 22 /r	Sign extend 2 packed 8-bit integers to 2 packed 64-bit integers
PMOVZXBQ xmm1, xmm2/m16	66 0f 38 32 /r	Zero extend 2 packed 8-bit integers to 2 packed 64-bit integers
PMOVSXWD xmm1, xmm2/m64	66 0f 38 23/r	Sign extend 4 packed 16-bit integers to 4 packed 32-bit integers
PMOVZXWD xmm1, xmm2/m64	66 0f 38 33 /r	Zero extend 4 packed 16-bit integers to 4 packed 32-bit integers
PMOVSXWQ xmm1, xmm2/m32	66 0f 38 24 /r	Sign extend 2 packed 16-bit integers to 2 packed 64-bit integers
PMOVZXWQ xmm1, xmm2/m32	66 0f 38 34 /r	Zero extend 2 packed 16-bit integers to 2 packed 64-bit integers
PMOVSXDQ xmm1, xmm2/m64	66 0f 38 25 /r	Sign extend 2 packed 32-bit integers to 2 packed 64-bit integers
PMOVZXDQ xmm1, xmm2/m64	66 0f 38 35 /r	Zero extend 2 packed 32-bit integers to 2 packed 64-bit integers
PTEST xmm1, xmm2/m128	66 0F 38 17 /r	Set ZF if AND result is all 0s, set CF if AND NOT result is all 0s
PCMPEQQ xmm1, xmm2/m128	66 0F 38 29 /r	Compare packed qwords for equality
PACKUSDW xmm1, xmm2/m128	66 0F 38 2B /r	Convert 2 × 4 packed signed doubleword integers into 8 packed unsigned word integers with saturation
MOVNTDQA xmm1, m128	66 0F 38 2A /r	Move double quadword using non-temporal hint if WC memory type

Added with Phenom processors

Instruction	Opcode	Meaning
EXTRQ	66 0F 78 /0 ib ib	Extract Field From Register
EXTRQ	66 0F 79 /r	Extract Field From Register
INSERTQ	F2 0F 78 /r ib ib	Insert Field
INSERTQ	F2 0F 79 /r	Insert Field
MOVNTSD	F2 0F 2B /r	Move Non-Temporal Scalar Double-Precision Floating-Point
MOVNTSS	F3 0F 2B /r	Move Non-Temporal Scalar Single-Precision Floating-Point

Added with Nehalem processors

Instruction	Opcode	Meaning
PCMPESTRI xmm1, xmm2/m128, imm8	66 0F 3A 61 /r imm8	Packed comparison of string data with explicit lengths, generating an index
PCMPESTRM xmm1, xmm2/m128, imm8	66 0F 3A 60 /r imm8	Packed comparison of string data with explicit lengths, generating a mask
PCMPISTRI xmm1, xmm2/m128, imm8	66 0F 3A 63 /r imm8	Packed comparison of string data with implicit lengths, generating an index
PCMPISTRM xmm1, xmm2/m128, imm8	66 0F 3A 62 /r imm8	Packed comparison of string data with implicit lengths, generating a mask
PCMPGTQ xmm1,xmm2/m128	66 0F 38 37 /r	Compare packed signed qwords for greater than.

Half-precision floating-point conversion.

Instruction	Meaning
	Convert four half-precision floating point values in memory or the bottom half of an XMM register to four single-precision floating-point values in an XMM register
	Convert eight half-precision floating point values in memory or an XMM register (the bottom half of a YMM register) to eight single-precision floating-point values in a YMM register
	Convert four single-precision floating point values in an XMM register to half-precision floating-point values in memory or the bottom half an XMM register
	Convert eight single-precision floating point values in a YMM register to half-precision floating-point values in memory or an XMM register

AVX were first supported by Intel with Sandy Bridge and by AMD with Bulldozer.

Vector operations on 256 bit registers.

Instruction	Description
VBROADCASTSS	Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register.
VBROADCASTSD
VBROADCASTF128
VINSERTF128	Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VEXTRACTF128	Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.
VMASKMOVPS	Conditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged. On the AMD Jaguar processor architecture, this instruction with a memory source operand takes more than 300 clock cycles when the mask is zero, in which case the instruction should do nothing. This appears to be a design flaw.^[2]
VMASKMOVPD
VPERMILPS	Permute In-Lane. Shuffle the 32-bit or 64-bit vector elements of one input operand. These are in-lane 256-bit instructions, meaning that they operate on all 256 bits with two separate 128-bit shuffles, so they can not shuffle across the 128-bit lanes.^[3]
VPERMILPD
VPERM2F128	Shuffle the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.
VZEROALL	Set all YMM registers to zero and tag them as unused. Used when switching between 128-bit use and 256-bit use.
VZEROUPPER	Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.

Introduced in Intel's Haswell microarchitecture and AMD's Excavator.

Expansion of most vector integer SSE and AVX instructions to 256 bits

Instruction	Description
VBROADCASTSS	Copy a 32-bit or 64-bit register operand to all elements of a XMM or YMM vector register. These are register versions of the same instructions in AVX1. There is no 128-bit version however, but the same effect can be simply achieved using VINSERTF128.
VBROADCASTSD
VPBROADCASTB	Copy an 8, 16, 32 or 64-bit integer register or memory operand to all elements of a XMM or YMM vector register.
VPBROADCASTW
VPBROADCASTD
VPBROADCASTQ
VBROADCASTI128	Copy a 128-bit memory operand to all elements of a YMM vector register.
VINSERTI128	Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VEXTRACTI128	Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.
VGATHERDPD	Gathers single or double precision floating point values using either 32 or 64-bit indices and scale.
VGATHERQPD
VGATHERDPS
VGATHERQPS
VPGATHERDD	Gathers 32 or 64-bit integer values using either 32 or 64-bit indices and scale.
VPGATHERDQ
VPGATHERQD
VPGATHERQQ
VPMASKMOVD	Conditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged.
VPMASKMOVQ
VPERMPS	Shuffle the eight 32-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector.
VPERMD
VPERMPD	Shuffle the four 64-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector.
VPERMQ
VPERM2I128	Shuffle (two of) the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.
VPBLENDD	Doubleword immediate version of the PBLEND instructions from SSE4.
VPSLLVD	Shift left logical. Allows variable shifts where each element is shifted according to the packed input.
VPSLLVQ
VPSRLVD	Shift right logical. Allows variable shifts where each element is shifted according to the packed input.
VPSRLVQ
VPSRAVD	Shift right arithmetically. Allows variable shifts where each element is shifted according to the packed input.

FMA3 and FMA4 instructions

See main article: FMA instruction set. Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.

FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form VEX.66.0F38 xy /r or EVEX.66.0F38 xy /r. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte xy consists of two nibbles, where the top nibble x selects operand ordering (9='132', A='213', B='231') and the bottom nibble y (values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (x and y outside the given ranges will result in something that is not an FMA3 instruction.)
At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:

vfmadd132sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm1*xmm3)+xmm2
vfmadd213sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm1)+xmm3
vfmadd231sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm3)+xmm1

For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form EVEX.66.MAP6.W0 xy /r with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024,^[4] similarly adds BF16 variants of the packed (but not scalar) FMA3 instructions – these all take the form EVEX.NP.MAP6.W0 xy /r with the opcode byte again working similar to the FP32/FP64 variants.(For the FMA4 instructions, no FP16 or BF16 variants are defined.)
FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form VEX.66.0F3A xx /r ib (no EVEX encodings are defined). The opcode byte xx uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.

For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:

vfmaddsd xmm1,xmm2,[mem],xmm3 will perform xmm1 ← (xmm2*[mem])+xmm3 and require a W=0 encoding.
vfmaddsd xmm1,xmm2,xmm3,[mem] will perform xmm1 ← (xmm2*xmm3)+[mem] and require a W=1 encoding.
vfmaddsd xmm1,xmm2,xmm3,xmm4 will perform xmm1 ← (xmm2*xmm3)+xmm4 and can be encoded with either W=0 or W=1.

Opcode table
The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:

Basic operation	Opcode byte	FP32 instructions	FP64 instructions	FP16 instructions (AVX512-FP16)	BF16 instructions (AVX10.2)
Packed alternating multiply-add/subtract (AB)-C* in even-numbered lanes (AB)+C* in odd-numbered lanes	`96`	`VFMADDSUB132PS`	`VFMADDSUB132PD`	`VFMADDSUB132PH`
	`A6`	`VFMADDSUB213PS`	`VFMADDSUB213PD`	`VFMADDSUB213PH`
	`B6`	`VFMADDSUB231PS`	`VFMADDSUB231PD`	`VFMADDSUB231PH`

Packed alternating multiply-subtract/add (AB)+C* in even-numbered lanes (AB)-C* in odd-numbered lanes	`97`	`VFMSUBADD132PS`	`VFMSUBADD132PD`	`VFMSUBADD132PH`
	`A7`	`VFMSUBADD213PS`	`VFMSUBADD213PD`	`VFMSUBADD213PH`
	`B7`	`VFMSUBADD231PS`	`VFMSUBADD231PD`	`VFMSUBADD231PH`

Packed multiply-add (AB)+C*	`98`	`VFMADD132PS`	`VFMADD132PD`	`VFMADD132PH`	`VFMADD132NEPBF16`
	`A8`	`VFMADD213PS`	`VFMADD213PD`	`VFMADD213PH`	`VFMADD213NEPBF16`
	`B8`	`VFMADD231PS`	`VFMADD231PD`	`VFMADD231PH`	`VFMADD231NEPBF16`

Scalar multiply-add (AB)+C*	`99`	`VFMADD132SS`	`VFMADD132SD`	`VFMADD132SH`
	`A9`	`VFMADD213SS`	`VFMADD213SD`	`VFMADD213SH`
	`B9`	`VFMADD231SS`	`VFMADD231SD`	`VFMADD231SH`

Packed multiply-subtract (AB)-C*	`9A`	`VFMSUB132PS`	`VFMSUB132PD`	`VFMSUB132PH`	`VFMSUB132NEPBF16`
	`AA`	`VFMSUB213PS`	`VFMSUB213PD`	`VFMSUB213PH`	`VFMSUB213NEPBF16`
	`BA`	`VFMSUB231PS`	`VFMSUB231PD`	`VFMSUB231PH`	`VFMSUB231NEPBF16`

Scalar multiply-subtract (AB)-C*	`9B`	`VFMSUB132SS`	`VFMSUB132SD`	`VFMSUB132SH`
	`AB`	`VFMSUB213SS`	`VFMSUB213SD`	`VFMSUB213SH`
	`BB`	`VFMSUB231SS`	`VFMSUB231SD`	`VFMSUB231SH`

Packed negative-multiply-add (-AB)+C*	`9C`	`VFNMADD132PS`	`VFNMADD132PD`	`VFNMADD132PH`	`VFNMADD132NEPBF16`
	`AC`	`VFNMADD213PS`	`VFNMADD213PD`	`VFNMADD213PH`	`VFNMADD213NEPBF16`
	`BC`	`VFNMADD231PS`	`VFNMADD231PD`	`VFNMADD231PH`	`VFNMADD231NEPBF16`

Scalar negative-multiply-add (-AB)+C*	`9D`	`VFMADD132SS`	`VFMADD132SD`	`VFMADD132SH`
	`AD`	`VFMADD213SS`	`VFMADD213SD`	`VFMADD213SH`
	`BD`	`VFMADD231SS`	`VFMADD231SD`	`VFMADD231SH`

Packed negative-multiply-subtract (-AB)-C*	`9E`	`VFNMSUB132PS`	`VFNMSUB132PD`	`VFNMSUB132PH`	`VFNMSUB132NEPBF16`
	`AE`	`VFNMSUB213PS`	`VFNMSUB213PD`	`VFNMSUB213PH`	`VFNMSUB213NEPBF16`
	`BE`	`VFNMSUB231PS`	`VFNMSUB231PD`	`VFNMSUB231PH`	`VFNMSUB231NEPBF16`

Scalar negative-multiply-subtract (-AB)-C*	`9F`	`VFNMSUB132SS`	`VFNMSUB132SD`	`VFNMSUB132SH`
	`AF`	`VFNMSUB213SS`	`VFNMSUB213SD`	`VFNMSUB213SH`
	`BF`	`VFNMSUB231SS`	`VFNMSUB231SD`	`VFNMSUB231SH`

AVX-512

AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory.^[5] Most of the added instructions may also be used with the 256- and 128-bit registers.

AMX

See main article: Advanced Matrix Extensions. Intel AMX adds eight new tile-registers, tmm0-tmm7, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.

Instruction mnemonics	Opcode	Instruction description	Added in

`LDTILECFG m512`	`VEX.128.NP.0F38.W0 49 /0`	Load AMX tile configuration data structure from memory as a 64-byte data structure.
`STTILECFG m512`	`VEX.128.66.0F38 W0 49 /0`	Store AMX tile configuration data structure to memory.
`TILERELEASE`	`VEX.128.NP.0F38.W0 49 C0`	Initialize `TILECFG` and tile data registers (`tmm0` to `tmm7`) to the INIT state (all-zeroes).
`TILEZERO tmm`		Zero out contents of one tile register.
`TILELOADD tmm, sibmem`		Load a data tile from memory into AMX tile register.
	`VEX.128.66.0F38.W0 4B /r`	Load a data tile from memory into AMX tile register, with a hint that data should not be kept in the nearest cache levels.
`TILESTORED mem, sibtmm`	`VEX.128.F3.0F38.W0 4B /r`	Store a data tile to memory from AMX tile register.

`TDPBSSD tmm1,tmm2,tmm3`	`VEX.128.F2.0F38.W0 5E /r`	Matrix multiply signed bytes from tmm2 with signed bytes from tmm3, accumulating result in tmm1.
`TDPBSUD tmm1,tmm2,tmm3`	`VEX.128.F3.0F38.W0 5E /r`	Matrix multiply signed bytes from tmm2 with unsigned bytes from tmm3, accumulating result in tmm1.
`TDPBUSD tmm1,tmm2,tmm3`	`VEX.128.66.0F38.W0 5E /r`	Matrix multiply unsigned bytes from tmm2 with signed bytes from tmm3, accumulating result in tmm1.
`TDPBUUD tmm1,tmm2,tmm3`	`VEX.128.NP.0F38.W0 5E /r`	Matrix multiply unsigned bytes from tmm2 with unsigned bytes from tmm3, accumulating result in tmm1.

`TDPBF16PS tmm1,tmm2,tmm3`	`VEX.128.F3.0F38.W0 5C /r`	Matrix multiply BF16 values from tmm2 with BF16 values from tmm3, accumulating result in tmm1.

`TDPFP16PS tmm1,tmm2,tmm3`	`VEX.128.F2.0F38.W0 5C /r`	Matrix multiply FP16 values from tmm2 with FP16 values from tmm3, accumulating result in tmm1.	(Granite Rapids)

	`VEX.128.NP.0F38.W0 6C /r`	Matrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating real part of result in tmm1.
	`VEX.128.66.0F38.W0 6C /r`	Matrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating imaginary part of result in tmm1.

External links

Intel Intrinsics Guide - searchable reference for Intel MMX/SSE/AVX/AVX512 SIMD intrinsics

Notes and References

Intel, Intel® 64 and IA-32 Architectures Optimization Reference Manual (order no. 248966-044, June 2021) section 3.5.2.3
Web site: The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers . October 17, 2016.
Web site: Chess programming AVX2 . October 17, 2016 . July 10, 2017 . https://web.archive.org/web/20170710034021/http://chessprogramming.wikispaces.com/AVX2 . dead .
Intel, Advanced Vector Extensions 10.2 Architecture Specification, order no. 361050-001, rev 1.0, July 2024. Archived on 1 Aug 2024.
Web site: Intel AVX-512 Instructions . Intel . 21 June 2022 . en.