The x86 instruction set has several times been extended with SIMD (Single instruction, multiple data) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.
MMX instructions operate on the mm registers, which are 64 bits wide. They are shared with the FPU registers.
Added with Pentium MMX
Instruction | Opcode | Meaning | Notes | |
---|---|---|---|---|
EMMS | 0F 77 | Empty MMX Technology State | Marks all x87 FPU registers for use by FPU | |
MOVD mm, r/m32 | 0F 6E /r | Move doubleword | ||
MOVD r/m32, mm | 0F 7E /r | Move doubleword | ||
MOVQ mm/m64, mm | 0F 7F /r | Move quadword | ||
MOVQ mm, mm/m64 | 0F 6F /r | Move quadword | ||
MOVQ mm, r/m64 | Move quadword | |||
MOVQ r/m64, mm | REX.W + 0F 7E /r | Move quadword | ||
0F 6B /r | Pack doublewords to words (signed with saturation) | |||
PACKSSWB mm1, mm2/m64 | 0F 63 /r | Pack words to bytes (signed with saturation) | ||
PACKUSWB mm, mm/m64 | 0F 67 /r | Pack words to bytes (unsigned with saturation) | ||
PADDB mm, mm/m64 | 0F FC /r | Add packed byte integers | ||
PADDW mm, mm/m64 | 0F FD /r | Add packed word integers | ||
PADDD mm, mm/m64 | 0F FE /r | Add packed doubleword integers | ||
PADDSB mm, mm/m64 | 0F EC /r | Add packed signed byte integers and saturate | ||
PADDSW mm, mm/m64 | 0F ED /r | Add packed signed word integers and saturate | ||
PADDUSB mm, mm/m64 | 0F DC /r | Add packed unsigned byte integers and saturate | ||
PADDUSW mm, mm/m64 | 0F DD /r | Add packed unsigned word integers and saturate | ||
PAND mm, mm/m64 | 0F DB /r | Bitwise AND | ||
PANDN mm, mm/m64 | 0F DF /r | Bitwise AND NOT | ||
POR mm, mm/m64 | 0F EB /r | Bitwise OR | ||
PXOR mm, mm/m64 | 0F EF /r | Bitwise XOR | ||
PCMPEQB mm, mm/m64 | 0F 74 /r | Compare packed bytes for equality | ||
PCMPEQW mm, mm/m64 | 0F 75 /r | Compare packed words for equality | ||
PCMPEQD mm, mm/m64 | 0F 76 /r | Compare packed doublewords for equality | ||
PCMPGTB mm, mm/m64 | 0F 64 /r | Compare packed signed byte integers for greater than | ||
PCMPGTW mm, mm/m64 | 0F 65 /r | Compare packed signed word integers for greater than | ||
PCMPGTD mm, mm/m64 | 0F 66 /r | Compare packed signed doubleword integers for greater than | ||
PMADDWD mm, mm/m64 | 0F F5 /r | Multiply packed words, add adjacent doubleword results | ||
PMULHW mm, mm/m64 | 0F E5 /r | Multiply packed signed word integers, store high 16 bits of results | ||
PMULLW mm, mm/m64 | 0F D5 /r | Multiply packed signed word integers, store low 16 bits of results | ||
PSLLW mm1, imm8 | 0F 71 /6 ib | Shift left words, shift in zeros | ||
PSLLW mm, mm/m64 | 0F F1 /r | Shift left words, shift in zeros | ||
PSLLD mm, imm8 | 0F 72 /6 ib | Shift left doublewords, shift in zeros | ||
PSLLD mm, mm/m64 | 0F F2 /r | Shift left doublewords, shift in zeros | ||
PSLLQ mm, imm8 | 0F 73 /6 ib | Shift left quadword, shift in zeros | ||
PSLLQ mm, mm/m64 | 0F F3 /r | Shift left quadword, shift in zeros | ||
PSRAD mm, imm8 | 0F 72 /4 ib | Shift right doublewords, shift in sign bits | ||
PSRAD mm, mm/m64 | 0F E2 /r | Shift right doublewords, shift in sign bits | ||
PSRAW mm, imm8 | 0F 71 /4 ib | Shift right words, shift in sign bits | ||
PSRAW mm, mm/m64 | 0F E1 /r | Shift right words, shift in sign bits | ||
PSRLW mm, imm8 | 0F 71 /2 ib | Shift right words, shift in zeros | ||
PSRLW mm, mm/m64 | 0F D1 /r | Shift right words, shift in zeros | ||
PSRLD mm, imm8 | 0F 72 /2 ib | Shift right doublewords, shift in zeros | ||
PSRLD mm, mm/m64 | 0F D2 /r | Shift right doublewords, shift in zeros | ||
PSRLQ mm, imm8 | 0F 73 /2 ib | Shift right quadword, shift in zeros | ||
PSRLQ mm, mm/m64 | 0F D3 /r | Shift right quadword, shift in zeros | ||
PSUBB mm, mm/m64 | 0F F8 /r | Subtract packed byte integers | ||
PSUBW mm, mm/m64 | 0F F9 /r | Subtract packed word integers | ||
PSUBD mm, mm/m64 | 0F FA /r | Subtract packed doubleword integers | ||
PSUBSB mm, mm/m64 | 0F E8 /r | Subtract signed packed bytes with saturation | ||
PSUBSW mm, mm/m64 | 0F E9 /r | Subtract signed packed words with saturation | ||
PSUBUSB mm, mm/m64 | 0F D8 /r | Subtract unsigned packed bytes with saturation | ||
PSUBUSW mm, mm/m64 | 0F D9 /r | Subtract unsigned packed words with saturation | ||
PUNPCKHBW mm, mm/m64 | 0F 68 /r | Unpack and interleave high-order bytes | ||
PUNPCKHWD mm, mm/m64 | 0F 69 /r | Unpack and interleave high-order words | ||
PUNPCKHDQ mm, mm/m64 | 0F 6A /r | Unpack and interleave high-order doublewords | ||
PUNPCKLBW mm, mm/m32 | 0F 60 /r | Unpack and interleave low-order bytes | ||
PUNPCKLWD mm, mm/m32 | 0F 61 /r | Unpack and interleave low-order words | ||
PUNPCKLDQ mm, mm/m32 | 0F 62 /r | Unpack and interleave low-order doublewords |
The following MMX instruction were added with SSE. They are also available on the Athlon under the name MMX+.
Instruction | Opcode | Meaning | |
---|---|---|---|
MASKMOVQ mm1, mm2 | 0F F7 /r | Masked Move of Quadword | |
MOVNTQ m64, mm | 0F E7 /r | Move Quadword Using Non-Temporal Hint | |
Shuffle Packed Words | |||
PINSRW mm, r32/m16, imm8 | 0F C4 /r | Insert Word | |
PEXTRW reg, mm, imm8 | 0F C5 /r | Extract Word | |
PMOVMSKB reg, mm | 0F D7 /r | Move Byte Mask | |
PMINUB mm1, mm2/m64 | 0F DA /r | Minimum of Packed Unsigned Byte Integers | |
PMAXUB mm1, mm2/m64 | 0F DE /r | Maximum of Packed Unsigned Byte Integers | |
PAVGB mm1, mm2/m64 | 0F E0 /r | Average Packed Integers | |
PAVGW mm1, mm2/m64 | 0F E3 /r | Average Packed Integers | |
PMULHUW mm1, mm2/m64 | 0F E4 /r | Multiply Packed Unsigned Integers and Store High Result | |
PMINSW mm1, mm2/m64 | 0F EA /r | Minimum of Packed Signed Word Integers | |
PMAXSW mm1, mm2/m64 | 0F EE /r | Maximum of Packed Signed Word Integers | |
PSADBW mm1, mm2/m64 | 0F F6 /r | Compute Sum of Absolute Differences |
The following MMX instructions were added with SSE2:
Instruction | Opcode | Meaning | |
---|---|---|---|
PADDQ mm, mm/m64 | 0F D4 /r | Add packed quadword integers | |
PSUBQ mm1, mm2/m64 | 0F FB /r | Subtract packed quadword integers | |
PMULUDQ mm1, mm2/m64 | 0F F4 /r | Multiply unsigned doubleword integer |
Instruction | Opcode | Meaning | |
---|---|---|---|
PSIGNB mm1, mm2/m64 | 0F 38 08 /r | Negate/zero/preserve packed byte integers depending on corresponding sign | |
PSIGNW mm1, mm2/m64 | 0F 38 09 /r | Negate/zero/preserve packed word integers depending on corresponding sign | |
PSIGND mm1, mm2/m64 | 0F 38 0A /r | Negate/zero/preserve packed doubleword integers depending on corresponding sign | |
PSHUFB mm1, mm2/m64 | 0F 38 00 /r | Shuffle bytes | |
PMULHRSW mm1, mm2/m64 | 0F 38 0B /r | Multiply 16-bit signed words, scale and round signed doublewords, pack high 16 bits | |
PMADDUBSW mm1, mm2/m64 | 0F 38 04 /r | Multiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed-words | |
PHSUBW mm1, mm2/m64 | 0F 38 05 /r | Subtract and pack 16-bit signed integers horizontally | |
PHSUBSW mm1, mm2/m64 | 0F 38 07 /r | Subtract and pack 16-bit signed integer horizontally with saturation | |
PHSUBD mm1, mm2/m64 | 0F 38 06 /r | Subtract and pack 32-bit signed integers horizontally | |
PHADDSW mm1, mm2/m64 | 0F 38 03 /r | Add and pack 16-bit signed integers horizontally, pack saturated integers to mm1. | |
PHADDW mm1, mm2/m64 | 0F 38 01 /r | Add and pack 16-bit integers horizontally | |
PHADDD mm1, mm2/m64 | 0F 38 02 /r | Add and pack 32-bit integers horizontally | |
Concatenate destination and source operands, extract byte-aligned result shifted to the right | |||
PABSB mm1, mm2/m64 | 0F 38 1C /r | Compute the absolute value of bytes and store unsigned result | |
PABSW mm1, mm2/m64 | 0F 38 1D /r | Compute the absolute value of 16-bit integers and store unsigned result | |
PABSD mm1, mm2/m64 | 0F 38 1E /r | Compute the absolute value of 32-bit integers and store unsigned result |
Added with Pentium III
SSE instructions operate on xmm registers, which are 128 bit wide.
SSE consists of the following SSE SIMD floating-point instructions:
Instruction | Opcode | Meaning | |
---|---|---|---|
ANDPS* xmm1, xmm2/m128 | 0F 54 /r | Bitwise Logical AND of Packed Single-Precision Floating-Point Values | |
ANDNPS* xmm1, xmm2/m128 | 0F 55 /r | Bitwise Logical AND NOT of Packed Single-Precision Floating-Point Values | |
ORPS* xmm1, xmm2/m128 | 0F 56 /r | Bitwise Logical OR of Single-Precision Floating-Point Values | |
XORPS* xmm1, xmm2/m128 | 0F 57 /r | Bitwise Logical XOR for Single-Precision Floating-Point Values | |
MOVUPS xmm1, xmm2/m128 | 0F 10 /r | Move Unaligned Packed Single-Precision Floating-Point Values | |
MOVSS xmm1, xmm2/m32 | F3 0F 10 /r | Move Scalar Single-Precision Floating-Point Values | |
MOVUPS xmm2/m128, xmm1 | 0F 11 /r | Move Unaligned Packed Single-Precision Floating-Point Values | |
MOVSS xmm2/m32, xmm1 | F3 0F 11 /r | Move Scalar Single-Precision Floating-Point Values | |
MOVLPS xmm, m64 | 0F 12 /r | Move Low Packed Single-Precision Floating-Point Values | |
MOVHLPS xmm1, xmm2 | 0F 12 /r | Move Packed Single-Precision Floating-Point Values High to Low | |
MOVLPS m64, xmm | 0F 13 /r | Move Low Packed Single-Precision Floating-Point Values | |
UNPCKLPS xmm1, xmm2/m128 | 0F 14 /r | Unpack and Interleave Low Packed Single-Precision Floating-Point Values | |
UNPCKHPS xmm1, xmm2/m128 | 0F 15 /r | Unpack and Interleave High Packed Single-Precision Floating-Point Values | |
MOVHPS xmm, m64 | 0F 16 /r | Move High Packed Single-Precision Floating-Point Values | |
MOVLHPS xmm1, xmm2 | 0F 16 /r | Move Packed Single-Precision Floating-Point Values Low to High | |
MOVHPS m64, xmm | 0F 17 /r | Move High Packed Single-Precision Floating-Point Values | |
MOVAPS xmm1, xmm2/m128 | 0F 28 /r | Move Aligned Packed Single-Precision Floating-Point Values | |
MOVAPS xmm2/m128, xmm1 | 0F 29 /r | Move Aligned Packed Single-Precision Floating-Point Values | |
MOVNTPS m128, xmm1 | 0F 2B /r | Move Aligned Four Packed Single-FP Non Temporal | |
MOVMSKPS reg, xmm | 0F 50 /r | Extract Packed Single-Precision Floating-Point 4-bit Sign Mask. The upper bits of the register are filled with zeros. | |
CVTPI2PS xmm, mm/m64 | 0F 2A /r | Convert Packed Dword Integers to Packed Single-Precision FP Values | |
CVTSI2SS xmm, r/m32 | F3 0F 2A /r | Convert Dword Integer to Scalar Single-Precision FP Value | |
CVTSI2SS xmm, r/m64 | F3 REX.W 0F 2A /r | Convert Qword Integer to Scalar Single-Precision FP Value | |
CVTTPS2PI mm, xmm/m64 | 0F 2C /r | Convert with Truncation Packed Single-Precision FP Values to Packed Dword Integers | |
CVTTSS2SI r32, xmm/m32 | F3 0F 2C /r | Convert with Truncation Scalar Single-Precision FP Value to Dword Integer | |
CVTTSS2SI r64, xmm1/m32 | F3 REX.W 0F 2C /r | Convert with Truncation Scalar Single-Precision FP Value to Qword Integer | |
CVTPS2PI mm, xmm/m64 | 0F 2D /r | Convert Packed Single-Precision FP Values to Packed Dword Integers | |
CVTSS2SI r32, xmm/m32 | F3 0F 2D /r | Convert Scalar Single-Precision FP Value to Dword Integer | |
CVTSS2SI r64, xmm1/m32 | Convert Scalar Single-Precision FP Value to Qword Integer | ||
UCOMISS xmm1, xmm2/m32 | 0F 2E /r | Unordered Compare Scalar Single-Precision Floating-Point Values and Set EFLAGS | |
COMISS xmm1, xmm2/m32 | 0F 2F /r | Compare Scalar Ordered Single-Precision Floating-Point Values and Set EFLAGS | |
SQRTPS xmm1, xmm2/m128 | 0F 51 /r | Compute Square Roots of Packed Single-Precision Floating-Point Values | |
SQRTSS xmm1, xmm2/m32 | F3 0F 51 /r | Compute Square Root of Scalar Single-Precision Floating-Point Value | |
RSQRTPS xmm1, xmm2/m128 | 0F 52 /r | Compute Reciprocal of Square Root of Packed Single-Precision Floating-Point Value | |
RSQRTSS xmm1, xmm2/m32 | F3 0F 52 /r | Compute Reciprocal of Square Root of Scalar Single-Precision Floating-Point Value | |
RCPPS xmm1, xmm2/m128 | 0F 53 /r | Compute Reciprocal of Packed Single-Precision Floating-Point Values | |
RCPSS xmm1, xmm2/m32 | F3 0F 53 /r | Compute Reciprocal of Scalar Single-Precision Floating-Point Values | |
ADDPS xmm1, xmm2/m128 | 0F 58 /r | Add Packed Single-Precision Floating-Point Values | |
ADDSS xmm1, xmm2/m32 | F3 0F 58 /r | Add Scalar Single-Precision Floating-Point Values | |
MULPS xmm1, xmm2/m128 | 0F 59 /r | Multiply Packed Single-Precision Floating-Point Values | |
MULSS xmm1, xmm2/m32 | F3 0F 59 /r | Multiply Scalar Single-Precision Floating-Point Values | |
SUBPS xmm1, xmm2/m128 | 0F 5C /r | Subtract Packed Single-Precision Floating-Point Values | |
SUBSS xmm1, xmm2/m32 | F3 0F 5C /r | Subtract Scalar Single-Precision Floating-Point Values | |
MINPS xmm1, xmm2/m128 | 0F 5D /r | Return Minimum Packed Single-Precision Floating-Point Values | |
MINSS xmm1, xmm2/m32 | F3 0F 5D /r | Return Minimum Scalar Single-Precision Floating-Point Values | |
DIVPS xmm1, xmm2/m128 | 0F 5E /r | Divide Packed Single-Precision Floating-Point Values | |
DIVSS xmm1, xmm2/m32 | F3 0F 5E /r | Divide Scalar Single-Precision Floating-Point Values | |
MAXPS xmm1, xmm2/m128 | 0F 5F /r | Return Maximum Packed Single-Precision Floating-Point Values | |
MAXSS xmm1, xmm2/m32 | F3 0F 5F /r | Return Maximum Scalar Single-Precision Floating-Point Values | |
LDMXCSR m32 | 0F AE /2 | Load MXCSR Register State | |
STMXCSR m32 | 0F AE /3 | Store MXCSR Register State | |
CMPPS xmm1, xmm2/m128, imm8 | 0F C2 /r ib | Compare Packed Single-Precision Floating-Point Values | |
CMPSS xmm1, xmm2/m32, imm8 | F3 0F C2 /r ib | Compare Scalar Single-Precision Floating-Point Values | |
0F C6 /r ib | Shuffle Packed Single-Precision Floating-Point Values |
* The floating point single bitwise operations ANDPS, ANDNPS, ORPS and XORPS produce the same result as the SSE2 integer (PAND, PANDN, POR, PXOR) and double ones (ANDPD, ANDNPD, ORPD, XORPD), but can introduce extra latency for domain changes when applied values of the wrong type.[1]
Added with Pentium 4
Instruction | Opcode | Meaning | |
---|---|---|---|
MOVAPD xmm1, xmm2/m128 | 66 0F 28 /r | Move Aligned Packed Double-Precision Floating-Point Values | |
MOVAPD xmm2/m128, xmm1 | 66 0F 29 /r | Move Aligned Packed Double-Precision Floating-Point Values | |
MOVNTPD m128, xmm1 | 66 0F 2B /r | Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint | |
MOVHPD xmm1, m64 | 66 0F 16 /r | Move High Packed Double-Precision Floating-Point Value | |
MOVHPD m64, xmm1 | 66 0F 17 /r | Move High Packed Double-Precision Floating-Point Value | |
MOVLPD xmm1, m64 | 66 0F 12 /r | Move Low Packed Double-Precision Floating-Point Value | |
MOVLPD m64, xmm1 | 66 0F 13/r | Move Low Packed Double-Precision Floating-Point Value | |
MOVUPD xmm1, xmm2/m128 | 66 0F 10 /r | Move Unaligned Packed Double-Precision Floating-Point Values | |
MOVUPD xmm2/m128, xmm1 | 66 0F 11 /r | Move Unaligned Packed Double-Precision Floating-Point Values | |
MOVMSKPD reg, xmm | 66 0F 50 /r | Extract Packed Double-Precision Floating-Point Sign Mask | |
MOVSD* xmm1, xmm2/m64 | F2 0F 10 /r | Move or Merge Scalar Double-Precision Floating-Point Value | |
MOVSD xmm1/m64, xmm2 | F2 0F 11 /r | Move or Merge Scalar Double-Precision Floating-Point Value |
Instruction | Opcode | Meaning | |
---|---|---|---|
ADDPD xmm1, xmm2/m128 | 66 0F 58 /r | Add Packed Double-Precision Floating-Point Values | |
ADDSD xmm1, xmm2/m64 | F2 0F 58 /r | Add Low Double-Precision Floating-Point Value | |
DIVPD xmm1, xmm2/m128 | 66 0F 5E /r | Divide Packed Double-Precision Floating-Point Values | |
DIVSD xmm1, xmm2/m64 | F2 0F 5E /r | Divide Scalar Double-Precision Floating-Point Value | |
MAXPD xmm1, xmm2/m128 | 66 0F 5F /r | Maximum of Packed Double-Precision Floating-Point Values | |
MAXSD xmm1, xmm2/m64 | F2 0F 5F /r | Return Maximum Scalar Double-Precision Floating-Point Value | |
MINPD xmm1, xmm2/m128 | 66 0F 5D /r | Minimum of Packed Double-Precision Floating-Point Values | |
MINSD xmm1, xmm2/m64 | F2 0F 5D /r | Return Minimum Scalar Double-Precision Floating-Point Value | |
MULPD xmm1, xmm2/m128 | 66 0F 59 /r | Multiply Packed Double-Precision Floating-Point Values | |
MULSD xmm1,xmm2/m64 | F2 0F 59 /r | Multiply Scalar Double-Precision Floating-Point Value | |
SQRTPD xmm1, xmm2/m128 | 66 0F 51 /r | Square Root of Double-Precision Floating-Point Values | |
SQRTSD xmm1,xmm2/m64 | F2 0F 51/r | Compute Square Root of Scalar Double-Precision Floating-Point Value | |
SUBPD xmm1, xmm2/m128 | 66 0F 5C /r | Subtract Packed Double-Precision Floating-Point Values | |
SUBSD xmm1, xmm2/m64 | F2 0F 5C /r | Subtract Scalar Double-Precision Floating-Point Value |
Instruction | Opcode | Meaning | |
---|---|---|---|
ANDPD xmm1, xmm2/m128 | 66 0F 54 /r | Bitwise Logical AND of Packed Double Precision Floating-Point Values | |
ANDNPD xmm1, xmm2/m128 | 66 0F 55 /r | Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values | |
ORPD xmm1, xmm2/m128 | 66 0F 56/r | Bitwise Logical OR of Packed Double Precision Floating-Point Values | |
XORPD xmm1, xmm2/m128 | 66 0F 57/r | Bitwise Logical XOR of Packed Double Precision Floating-Point Values |
Instruction | Opcode | Meaning | |
---|---|---|---|
CMPPD xmm1, xmm2/m128, imm8 | 66 0F C2 /r ib | Compare Packed Double-Precision Floating-Point Values | |
CMPSD* xmm1, xmm2/m64, imm8 | F2 0F C2 /r ib | Compare Low Double-Precision Floating-Point Values | |
COMISD xmm1, xmm2/m64 | 66 0F 2F /r | Compare Scalar Ordered Double-Precision Floating-Point Values and Set EFLAGS | |
UCOMISD xmm1, xmm2/m64 | 66 0F 2E /r | Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS |
Instruction | Opcode | Meaning | |
---|---|---|---|
SHUFPD xmm1, xmm2/m128, imm8 | 66 0F C6 /r ib | Packed Interleave Shuffle of Pairs of Double-Precision Floating-Point Values | |
UNPCKHPD xmm1, xmm2/m128 | 66 0F 15 /r | Unpack and Interleave High Packed Double-Precision Floating-Point Values | |
UNPCKLPD xmm1, xmm2/m128 | 66 0F 14 /r | Unpack and Interleave Low Packed Double-Precision Floating-Point Values |
Instruction | Opcode | Meaning | |
---|---|---|---|
CVTDQ2PD xmm1, xmm2/m64 | F3 0F E6 /r | Convert Packed Doubleword Integers to Packed Double-Precision Floating-Point Values | |
CVTDQ2PS xmm1, xmm2/m128 | 0F 5B /r | Convert Packed Doubleword Integers to Packed Single-Precision Floating-Point Values | |
CVTPD2DQ xmm1, xmm2/m128 | F2 0F E6 /r | Convert Packed Double-Precision Floating-Point Values to Packed Doubleword Integers | |
CVTPD2PI mm, xmm/m128 | 66 0F 2D /r | Convert Packed Double-Precision FP Values to Packed Dword Integers | |
CVTPD2PS xmm1, xmm2/m128 | 66 0F 5A /r | Convert Packed Double-Precision Floating-Point Values to Packed Single-Precision Floating-Point Values | |
CVTPI2PD xmm, mm/m64 | 66 0F 2A /r | Convert Packed Dword Integers to Packed Double-Precision FP Values | |
CVTPS2DQ xmm1, xmm2/m128 | 66 0F 5B /r | Convert Packed Single-Precision Floating-Point Values to Packed Signed Doubleword Integer Values | |
CVTPS2PD xmm1, xmm2/m64 | 0F 5A /r | Convert Packed Single-Precision Floating-Point Values to Packed Double-Precision Floating-Point Values | |
CVTSD2SI r32, xmm1/m64 | F2 0F 2D /r | Convert Scalar Double-Precision Floating-Point Value to Doubleword Integer | |
CVTSD2SI r64, xmm1/m64 | F2 REX.W 0F 2D /r | Convert Scalar Double-Precision Floating-Point Value to Quadword Integer With Sign Extension | |
CVTSD2SS xmm1, xmm2/m64 | F2 0F 5A /r | Convert Scalar Double-Precision Floating-Point Value to Scalar Single-Precision Floating-Point Value | |
CVTSI2SD xmm1, r32/m32 | F2 0F 2A /r | Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value | |
CVTSI2SD xmm1, r/m64 | F2 REX.W 0F 2A /r | Convert Quadword Integer to Scalar Double-Precision Floating-Point value | |
CVTSS2SD xmm1, xmm2/m32 | F3 0F 5A /r | Convert Scalar Single-Precision Floating-Point Value to Scalar Double-Precision Floating-Point Value | |
CVTTPD2DQ xmm1, xmm2/m128 | 66 0F E6 /r | Convert with Truncation Packed Double-Precision Floating-Point Values to Packed Doubleword Integers | |
CVTTPD2PI mm, xmm/m128 | 66 0F 2C /r | Convert with Truncation Packed Double-Precision FP Values to Packed Dword Integers | |
CVTTPS2DQ xmm1, xmm2/m128 | F3 0F 5B /r | Convert with Truncation Packed Single-Precision Floating-Point Values to Packed Signed Doubleword Integer Values | |
CVTTSD2SI r32, xmm1/m64 | F2 0F 2C /r | Convert with Truncation Scalar Double-Precision Floating-Point Value to Signed Dword Integer | |
CVTTSD2SI r64, xmm1/m64 | F2 REX.W 0F 2C /r | Convert with Truncation Scalar Double-Precision Floating-Point Value To Signed Qword Integer |
SSE2 allows execution of MMX instructions on SSE registers, processing twice the amount of data at once.
Instruction | Opcode | Meaning | |
---|---|---|---|
Move doubleword | |||
Move doubleword | |||
Move quadword | |||
Move quadword | |||
66 REX.W 0F 7E /r | Move quadword | ||
Move quadword | |||
Move a byte mask, zeroing the upper bits of the register | |||
Extract specified word and move it to reg, setting bits 15-0 and zeroing the rest | |||
Move low word at the specified word position | |||
Converts 4 packed signed doubleword integers into 8 packed signed word integers with saturation | |||
Converts 8 packed signed word integers into 16 packed signed byte integers with saturation | |||
Converts 8 signed word integers into 16 unsigned byte integers with saturation | |||
Add packed byte integers | |||
Add packed word integers | |||
Add packed doubleword integers | |||
Add packed quadword integers. | |||
Add packed signed byte integers with saturation | |||
Add packed signed word integers with saturation | |||
Add packed unsigned byte integers with saturation | |||
Add packed unsigned word integers with saturation | |||
Bitwise AND | |||
Bitwise AND NOT | |||
Bitwise OR | |||
Bitwise XOR | |||
Compare packed bytes for equality. | |||
Compare packed words for equality. | |||
Compare packed doublewords for equality. | |||
Compare packed signed byte integers for greater than | |||
Compare packed signed word integers for greater than | |||
Compare packed signed doubleword integers for greater than | |||
Multiply packed signed word integers with saturation | |||
Multiply the packed signed word integers, store the high 16 bits of the results | |||
Multiply packed unsigned word integers, store the high 16 bits of the results | |||
Multiply packed unsigned doubleword integers | |||
Shift words left while shifting in 0s | |||
Shift words left while shifting in 0s | |||
Shift doublewords left while shifting in 0s | |||
Shift doublewords left while shifting in 0s | |||
Shift quadwords left while shifting in 0s | |||
Shift quadwords left while shifting in 0s | |||
Shift doubleword right while shifting in sign bits | |||
Shift doublewords right while shifting in sign bits | |||
Shift words right while shifting in sign bits | |||
Shift words right while shifting in sign bits | |||
Shift words right while shifting in 0s | |||
Shift words right while shifting in 0s | |||
Shift doublewords right while shifting in 0s | |||
Shift doublewords right while shifting in 0s | |||
Shift quadwords right while shifting in 0s | |||
Shift quadwords right while shifting in 0s | |||
Subtract packed byte integers | |||
Subtract packed word integers | |||
Subtract packed doubleword integers | |||
Subtract packed quadword integers. | |||
Subtract packed signed byte integers with saturation | |||
Subtract packed signed word integers with saturation | |||
Multiply the packed word integers, add adjacent doubleword results | |||
Subtract packed unsigned byte integers with saturation | |||
Subtract packed unsigned word integers with saturation | |||
Unpack and interleave high-order bytes | |||
Unpack and interleave high-order words | |||
Unpack and interleave high-order doublewords | |||
Interleave low-order bytes | |||
Interleave low-order words | |||
Interleave low-order doublewords | |||
Average packed unsigned byte integers with rounding | |||
Average packed unsigned word integers with rounding | |||
Compare packed unsigned byte integers and store packed minimum values | |||
Compare packed signed word integers and store packed minimum values | |||
Compare packed signed word integers and store maximum packed values | |||
Compare packed unsigned byte integers and store packed maximum values | |||
Computes the absolute differences of the packed unsigned byte integers; the 8 low differences and 8 high differences are then summed separately to produce two unsigned word integer results |
The following instructions can be used only on SSE registers, since by their nature they do not work on MMX registers
Instruction | Opcode | Meaning | |
---|---|---|---|
MASKMOVDQU xmm1, xmm2 | 66 0F F7 /r | Non-Temporal Store of Selected Bytes from an XMM Register into Memory | |
MOVDQ2Q mm, xmm | F2 0F D6 /r | Move low quadword from XMM to MMX register. | |
MOVDQA xmm1, xmm2/m128 | 66 0F 6F /r | Move aligned double quadword | |
MOVDQA xmm2/m128, xmm1 | 66 0F 7F /r | Move aligned double quadword | |
MOVDQU xmm1, xmm2/m128 | F3 0F 6F /r | Move unaligned double quadword | |
MOVDQU xmm2/m128, xmm1 | F3 0F 7F /r | Move unaligned double quadword | |
MOVQ2DQ xmm, mm | F3 0F D6 /r | Move quadword from MMX register to low quadword of XMM register | |
MOVNTDQ m128, xmm1 | 66 0F E7 /r | Store Packed Integers Using Non-Temporal Hint | |
PSHUFHW xmm1, xmm2/m128, imm8 | F3 0F 70 /r ib | Shuffle packed high words. | |
PSHUFLW xmm1, xmm2/m128, imm8 | F2 0F 70 /r ib | Shuffle packed low words. | |
PSHUFD xmm1, xmm2/m128, imm8 | 66 0F 70 /r ib | Shuffle packed doublewords. | |
PSLLDQ xmm1, imm8 | 66 0F 73 /7 ib | Packed shift left logical double quadwords. | |
PSRLDQ xmm1, imm8 | 66 0F 73 /3 ib | Packed shift right logical double quadwords. | |
PUNPCKHQDQ xmm1, xmm2/m128 | 66 0F 6D /r | Unpack and interleave high-order quadwords, | |
PUNPCKLQDQ xmm1, xmm2/m128 | 66 0F 6C /r | Interleave low quadwords, |
Added with Pentium 4 supporting SSE3
Instruction | Opcode | Meaning | Notes |
---|---|---|---|
ADDSUBPS xmm1, xmm2/m128 | F2 0F D0 /r | Add/subtract single-precision floating-point values | for Complex Arithmetic |
ADDSUBPD xmm1, xmm2/m128 | 66 0F D0 /r | Add/subtract double-precision floating-point values | |
MOVDDUP xmm1, xmm2/m64 | F2 0F 12 /r | Move double-precision floating-point value and duplicate | |
MOVSLDUP xmm1, xmm2/m128 | F3 0F 12 /r | Move and duplicate even index single-precision floating-point values | |
MOVSHDUP xmm1, xmm2/m128 | F3 0F 16 /r | Move and duplicate odd index single-precision floating-point values | |
HADDPS xmm1, xmm2/m128 | F2 0F 7C /r | Horizontal add packed single-precision floating-point values | for Graphics |
HADDPD xmm1, xmm2/m128 | 66 0F 7C /r | Horizontal add packed double-precision floating-point values | |
HSUBPS xmm1, xmm2/m128 | F2 0F 7D /r | Horizontal subtract packed single-precision floating-point values | |
HSUBPD xmm1, xmm2/m128 | 66 0F 7D /r | Horizontal subtract packed double-precision floating-point values |
Added with Xeon 5100 series and initial Core 2
The following MMX-like instructions extended to SSE registers were added with SSSE3
Instruction | Opcode | Meaning | |
---|---|---|---|
PSIGNB xmm1, xmm2/m128 | 66 0F 38 08 /r | Negate/zero/preserve packed byte integers depending on corresponding sign | |
PSIGNW xmm1, xmm2/m128 | 66 0F 38 09 /r | Negate/zero/preserve packed word integers depending on corresponding sign | |
PSIGND xmm1, xmm2/m128 | 66 0F 38 0A /r | Negate/zero/preserve packed doubleword integers depending on corresponding | |
PSHUFB xmm1, xmm2/m128 | 66 0F 38 00 /r | Shuffle bytes | |
PMULHRSW xmm1, xmm2/m128 | 66 0F 38 0B /r | Multiply 16-bit signed words, scale and round signed doublewords, pack high 16 bits | |
PMADDUBSW xmm1, xmm2/m128 | 66 0F 38 04 /r | Multiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed-words | |
PHSUBW xmm1, xmm2/m128 | 66 0F 38 05 /r | Subtract and pack 16-bit signed integers horizontally | |
PHSUBSW xmm1, xmm2/m128 | 66 0F 38 07 /r | Subtract and pack 16-bit signed integer horizontally with saturation | |
PHSUBD xmm1, xmm2/m128 | 66 0F 38 06 /r | Subtract and pack 32-bit signed integers horizontally | |
PHADDSW xmm1, xmm2/m128 | 66 0F 38 03 /r | Add and pack 16-bit signed integers horizontally with saturation | |
PHADDW xmm1, xmm2/m128 | 66 0F 38 01 /r | Add and pack 16-bit integers horizontally | |
PHADDD xmm1, xmm2/m128 | 66 0F 38 02 /r | Add and pack 32-bit integers horizontally | |
PALIGNR xmm1, xmm2/m128, imm8 | 66 0F 3A 0F /r ib | Concatenate destination and source operands, extract byte-aligned result shifted to the right | |
PABSB xmm1, xmm2/m128 | 66 0F 38 1C /r | Compute the absolute value of bytes and store unsigned result | |
PABSW xmm1, xmm2/m128 | 66 0F 38 1D /r | Compute the absolute value of 16-bit integers and store unsigned result | |
PABSD xmm1, xmm2/m128 | 66 0F 38 1E /r | Compute the absolute value of 32-bit integers and store unsigned result |
Added with Core 2 manufactured in 45nm
Instruction | Opcode | Meaning | |
---|---|---|---|
DPPS xmm1, xmm2/m128, imm8 | 66 0F 3A 40 /r ib | Selectively multiply packed SP floating-point values, add and selectively store | |
DPPD xmm1, xmm2/m128, imm8 | 66 0F 3A 41 /r ib | Selectively multiply packed DP floating-point values, add and selectively store | |
BLENDPS xmm1, xmm2/m128, imm8 | 66 0F 3A 0C /r ib | Select packed single precision floating-point values from specified mask | |
BLENDVPS xmm1, xmm2/m128, <XMM0> | 66 0F 38 14 /r | Select packed single precision floating-point values from specified mask | |
BLENDPD xmm1, xmm2/m128, imm8 | 66 0F 3A 0D /r ib | Select packed DP-FP values from specified mask | |
BLENDVPD xmm1, xmm2/m128, <XMM0> | 66 0F 38 15 /r | Select packed DP FP values from specified mask | |
ROUNDPS xmm1, xmm2/m128, imm8 | 66 0F 3A 08 /r ib | Round packed single precision floating-point values | |
ROUNDSS xmm1, xmm2/m32, imm8 | 66 0F 3A 0A /r ib | Round the low packed single precision floating-point value | |
ROUNDPD xmm1, xmm2/m128, imm8 | 66 0F 3A 09 /r ib | Round packed double precision floating-point values | |
ROUNDSD xmm1, xmm2/m64, imm8 | 66 0F 3A 0B /r ib | Round the low packed double precision floating-point value | |
INSERTPS xmm1, xmm2/m32, imm8 | 66 0F 3A 21 /r ib | Insert a selected single-precision floating-point value at the specified destination element and zero out destination elements | |
EXTRACTPS reg/m32, xmm1, imm8 | 66 0F 3A 17 /r ib | Extract one single-precision floating-point value at specified offset and store the result (zero-extended, if applicable) |
Instruction | Opcode | Meaning | |
---|---|---|---|
MPSADBW xmm1, xmm2/m128, imm8 | 66 0F 3A 42 /r ib | Sums absolute 8-bit integer difference of adjacent groups of 4 byte integers with starting offset | |
PHMINPOSUW xmm1, xmm2/m128 | 66 0F 38 41 /r | Find the minimum unsigned word | |
PMULLD xmm1, xmm2/m128 | 66 0F 38 40 /r | Multiply the packed dword signed integers and store the low 32 bits | |
PMULDQ xmm1, xmm2/m128 | 66 0F 38 28 /r | Multiply packed signed doubleword integers and store quadword result | |
PBLENDVB xmm1, xmm2/m128, <XMM0> | 66 0F 38 10 /r | Select byte values from specified mask | |
PBLENDW xmm1, xmm2/m128, imm8 | 66 0F 3A 0E /r ib | Select words from specified mask | |
PMINSB xmm1, xmm2/m128 | 66 0F 38 38 /r | Compare packed signed byte integers | |
PMINUW xmm1, xmm2/m128 | 66 0F 38 3A/r | Compare packed unsigned word integers | |
PMINSD xmm1, xmm2/m128 | 66 0F 38 39 /r | Compare packed signed dword integers | |
PMINUD xmm1, xmm2/m128 | 66 0F 38 3B /r | Compare packed unsigned dword integers | |
PMAXSB xmm1, xmm2/m128 | 66 0F 38 3C /r | Compare packed signed byte integers | |
PMAXUW xmm1, xmm2/m128 | 66 0F 38 3E/r | Compare packed unsigned word integers | |
PMAXSD xmm1, xmm2/m128 | 66 0F 38 3D /r | Compare packed signed dword integers | |
PMAXUD xmm1, xmm2/m128 | 66 0F 38 3F /r | Compare packed unsigned dword integers | |
PINSRB xmm1, r32/m8, imm8 | 66 0F 3A 20 /r ib | Insert a byte integer value at specified destination element | |
PINSRD xmm1, r/m32, imm8 | 66 0F 3A 22 /r ib | Insert a dword integer value at specified destination element | |
PINSRQ xmm1, r/m64, imm8 | 66 REX.W 0F 3A 22 /r ib | Insert a qword integer value at specified destination element | |
PEXTRB reg/m8, xmm2, imm8 | 66 0F 3A 14 /r ib | Extract a byte integer value at source byte offset, upper bits are zeroed. | |
PEXTRW reg/m16, xmm, imm8 | 66 0F 3A 15 /r ib | Extract word and copy to lowest 16 bits, zero-extended | |
PEXTRD r/m32, xmm2, imm8 | 66 0F 3A 16 /r ib | Extract a dword integer value at source dword offset | |
PEXTRQ r/m64, xmm2, imm8 | 66 REX.W 0F 3A 16 /r ib | Extract a qword integer value at source qword offset | |
PMOVSXBW xmm1, xmm2/m64 | 66 0f 38 20 /r | Sign extend 8 packed 8-bit integers to 8 packed 16-bit integers | |
PMOVZXBW xmm1, xmm2/m64 | 66 0f 38 30 /r | Zero extend 8 packed 8-bit integers to 8 packed 16-bit integers | |
PMOVSXBD xmm1, xmm2/m32 | 66 0f 38 21 /r | Sign extend 4 packed 8-bit integers to 4 packed 32-bit integers | |
PMOVZXBD xmm1, xmm2/m32 | 66 0f 38 31 /r | Zero extend 4 packed 8-bit integers to 4 packed 32-bit integers | |
PMOVSXBQ xmm1, xmm2/m16 | 66 0f 38 22 /r | Sign extend 2 packed 8-bit integers to 2 packed 64-bit integers | |
PMOVZXBQ xmm1, xmm2/m16 | 66 0f 38 32 /r | Zero extend 2 packed 8-bit integers to 2 packed 64-bit integers | |
PMOVSXWD xmm1, xmm2/m64 | 66 0f 38 23/r | Sign extend 4 packed 16-bit integers to 4 packed 32-bit integers | |
PMOVZXWD xmm1, xmm2/m64 | 66 0f 38 33 /r | Zero extend 4 packed 16-bit integers to 4 packed 32-bit integers | |
PMOVSXWQ xmm1, xmm2/m32 | 66 0f 38 24 /r | Sign extend 2 packed 16-bit integers to 2 packed 64-bit integers | |
PMOVZXWQ xmm1, xmm2/m32 | 66 0f 38 34 /r | Zero extend 2 packed 16-bit integers to 2 packed 64-bit integers | |
PMOVSXDQ xmm1, xmm2/m64 | 66 0f 38 25 /r | Sign extend 2 packed 32-bit integers to 2 packed 64-bit integers | |
PMOVZXDQ xmm1, xmm2/m64 | 66 0f 38 35 /r | Zero extend 2 packed 32-bit integers to 2 packed 64-bit integers | |
PTEST xmm1, xmm2/m128 | 66 0F 38 17 /r | Set ZF if AND result is all 0s, set CF if AND NOT result is all 0s | |
PCMPEQQ xmm1, xmm2/m128 | 66 0F 38 29 /r | Compare packed qwords for equality | |
PACKUSDW xmm1, xmm2/m128 | 66 0F 38 2B /r | Convert 2 × 4 packed signed doubleword integers into 8 packed unsigned word integers with saturation | |
MOVNTDQA xmm1, m128 | 66 0F 38 2A /r | Move double quadword using non-temporal hint if WC memory type |
Added with Phenom processors
Instruction | Opcode | Meaning | |
---|---|---|---|
EXTRQ | 66 0F 78 /0 ib ib | Extract Field From Register | |
66 0F 79 /r | |||
INSERTQ | F2 0F 78 /r ib ib | Insert Field | |
F2 0F 79 /r | |||
MOVNTSD | F2 0F 2B /r | Move Non-Temporal Scalar Double-Precision Floating-Point | |
MOVNTSS | F3 0F 2B /r | Move Non-Temporal Scalar Single-Precision Floating-Point |
Added with Nehalem processors
Instruction | Opcode | Meaning | |
---|---|---|---|
PCMPESTRI xmm1, xmm2/m128, imm8 | 66 0F 3A 61 /r imm8 | Packed comparison of string data with explicit lengths, generating an index | |
PCMPESTRM xmm1, xmm2/m128, imm8 | 66 0F 3A 60 /r imm8 | Packed comparison of string data with explicit lengths, generating a mask | |
PCMPISTRI xmm1, xmm2/m128, imm8 | 66 0F 3A 63 /r imm8 | Packed comparison of string data with implicit lengths, generating an index | |
PCMPISTRM xmm1, xmm2/m128, imm8 | 66 0F 3A 62 /r imm8 | Packed comparison of string data with implicit lengths, generating a mask | |
PCMPGTQ xmm1,xmm2/m128 | 66 0F 38 37 /r | Compare packed signed qwords for greater than. |
Half-precision floating-point conversion.
Instruction | Meaning | |
---|---|---|
Convert four half-precision floating point values in memory or the bottom half of an XMM register to four single-precision floating-point values in an XMM register | ||
Convert eight half-precision floating point values in memory or an XMM register (the bottom half of a YMM register) to eight single-precision floating-point values in a YMM register | ||
Convert four single-precision floating point values in an XMM register to half-precision floating-point values in memory or the bottom half an XMM register | ||
Convert eight single-precision floating point values in a YMM register to half-precision floating-point values in memory or an XMM register |
AVX were first supported by Intel with Sandy Bridge and by AMD with Bulldozer.
Vector operations on 256 bit registers.
Instruction | Description | |
---|---|---|
VBROADCASTSS | Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register. | |
VBROADCASTSD | ||
VBROADCASTF128 | ||
VINSERTF128 | Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged. | |
VEXTRACTF128 | Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand. | |
VMASKMOVPS | Conditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged. On the AMD Jaguar processor architecture, this instruction with a memory source operand takes more than 300 clock cycles when the mask is zero, in which case the instruction should do nothing. This appears to be a design flaw.[2] | |
VMASKMOVPD | ||
VPERMILPS | Permute In-Lane. Shuffle the 32-bit or 64-bit vector elements of one input operand. These are in-lane 256-bit instructions, meaning that they operate on all 256 bits with two separate 128-bit shuffles, so they can not shuffle across the 128-bit lanes.[3] | |
VPERMILPD | ||
VPERM2F128 | Shuffle the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector. | |
VZEROALL | Set all YMM registers to zero and tag them as unused. Used when switching between 128-bit use and 256-bit use. | |
VZEROUPPER | Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use. |
Introduced in Intel's Haswell microarchitecture and AMD's Excavator.
Expansion of most vector integer SSE and AVX instructions to 256 bits
Instruction | Description | |
---|---|---|
VBROADCASTSS | Copy a 32-bit or 64-bit register operand to all elements of a XMM or YMM vector register. These are register versions of the same instructions in AVX1. There is no 128-bit version however, but the same effect can be simply achieved using VINSERTF128. | |
VBROADCASTSD | ||
VPBROADCASTB | Copy an 8, 16, 32 or 64-bit integer register or memory operand to all elements of a XMM or YMM vector register. | |
VPBROADCASTW | ||
VPBROADCASTD | ||
VPBROADCASTQ | ||
VBROADCASTI128 | Copy a 128-bit memory operand to all elements of a YMM vector register. | |
VINSERTI128 | Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged. | |
VEXTRACTI128 | Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand. | |
VGATHERDPD | Gathers single or double precision floating point values using either 32 or 64-bit indices and scale. | |
VGATHERQPD | ||
VGATHERDPS | ||
VGATHERQPS | ||
VPGATHERDD | Gathers 32 or 64-bit integer values using either 32 or 64-bit indices and scale. | |
VPGATHERDQ | ||
VPGATHERQD | ||
VPGATHERQQ | ||
VPMASKMOVD | Conditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged. | |
VPMASKMOVQ | ||
VPERMPS | Shuffle the eight 32-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector. | |
VPERMD | ||
VPERMPD | Shuffle the four 64-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector. | |
VPERMQ | ||
VPERM2I128 | Shuffle (two of) the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector. | |
VPBLENDD | Doubleword immediate version of the PBLEND instructions from SSE4. | |
VPSLLVD | Shift left logical. Allows variable shifts where each element is shifted according to the packed input. | |
VPSLLVQ | ||
VPSRLVD | Shift right logical. Allows variable shifts where each element is shifted according to the packed input. | |
VPSRLVQ | ||
VPSRAVD | Shift right arithmetically. Allows variable shifts where each element is shifted according to the packed input. |
See main article: FMA instruction set. Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.
FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form VEX.66.0F38 <b>xy</b> /r
or EVEX.66.0F38 <b>xy</b> /r
. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte <b>xy</b>
consists of two nibbles, where the top nibble <b>x</b>
selects operand ordering (9
='132', A
='213', B
='231') and the bottom nibble <b>y</b>
(values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (<b>x</b>
and <b>y</b>
outside the given ranges will result in something that is not an FMA3 instruction.)
At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:
vfmadd<b>132</b>sd xmm1,xmm2,xmm3
will perform xmm1 ← (xmm<b>1</b>*xmm<b>3</b>)+xmm<b>2</b>
vfmadd<b>213</b>sd xmm1,xmm2,xmm3
will perform xmm1 ← (xmm<b>2</b>*xmm<b>1</b>)+xmm<b>3</b>
vfmadd<b>231</b>sd xmm1,xmm2,xmm3
will perform xmm1 ← (xmm<b>2</b>*xmm<b>3</b>)+xmm<b>1</b>
For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form EVEX.66.MAP6.W0 <b>xy</b> /r
with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024,[4] similarly adds BF16 variants of the packed (but not scalar) FMA3 instructions – these all take the form EVEX.NP.MAP6.W0 <b>xy</b> /r
with the opcode byte again working similar to the FP32/FP64 variants.(For the FMA4 instructions, no FP16 or BF16 variants are defined.)
FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form VEX.66.0F3A <b>xx</b> /r ib
(no EVEX encodings are defined). The opcode byte <b>xx</b>
uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.
For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:
vfmaddsd xmm1,xmm2,[mem],xmm3
will perform xmm1 ← (xmm2*[mem])+xmm3
and require a W=0 encoding.vfmaddsd xmm1,xmm2,xmm3,[mem]
will perform xmm1 ← (xmm2*xmm3)+[mem]
and require a W=1 encoding.vfmaddsd xmm1,xmm2,xmm3,xmm4
will perform xmm1 ← (xmm2*xmm3)+xmm4
and can be encoded with either W=0 or W=1.
Opcode table
The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:
Basic operation | Opcode byte | FP32 instructions | FP64 instructions | FP16 instructions (AVX512-FP16) | BF16 instructions (AVX10.2) |
---|---|---|---|---|---|
Packed alternating multiply-add/subtract
| 96 | VFMADDSUB132PS | VFMADDSUB132PD | VFMADDSUB132PH | |
A6 | VFMADDSUB213PS | VFMADDSUB213PD | VFMADDSUB213PH | ||
B6 | VFMADDSUB231PS | VFMADDSUB231PD | VFMADDSUB231PH | ||
Packed alternating multiply-subtract/add
| 97 | VFMSUBADD132PS | VFMSUBADD132PD | VFMSUBADD132PH | |
A7 | VFMSUBADD213PS | VFMSUBADD213PD | VFMSUBADD213PH | ||
B7 | VFMSUBADD231PS | VFMSUBADD231PD | VFMSUBADD231PH | ||
Packed multiply-add (A*B)+C | 98 | VFMADD132PS | VFMADD132PD | VFMADD132PH | VFMADD132NEPBF16 |
A8 | VFMADD213PS | VFMADD213PD | VFMADD213PH | VFMADD213NEPBF16 | |
B8 | VFMADD231PS | VFMADD231PD | VFMADD231PH | VFMADD231NEPBF16 | |
Scalar multiply-add (A*B)+C | 99 | VFMADD132SS | VFMADD132SD | VFMADD132SH | |
A9 | VFMADD213SS | VFMADD213SD | VFMADD213SH | ||
B9 | VFMADD231SS | VFMADD231SD | VFMADD231SH | ||
Packed multiply-subtract (A*B)-C | 9A | VFMSUB132PS | VFMSUB132PD | VFMSUB132PH | VFMSUB132NEPBF16 |
AA | VFMSUB213PS | VFMSUB213PD | VFMSUB213PH | VFMSUB213NEPBF16 | |
BA | VFMSUB231PS | VFMSUB231PD | VFMSUB231PH | VFMSUB231NEPBF16 | |
Scalar multiply-subtract (A*B)-C | 9B | VFMSUB132SS | VFMSUB132SD | VFMSUB132SH | |
AB | VFMSUB213SS | VFMSUB213SD | VFMSUB213SH | ||
BB | VFMSUB231SS | VFMSUB231SD | VFMSUB231SH | ||
Packed negative-multiply-add (-A*B)+C | 9C | VFNMADD132PS | VFNMADD132PD | VFNMADD132PH | VFNMADD132NEPBF16 |
AC | VFNMADD213PS | VFNMADD213PD | VFNMADD213PH | VFNMADD213NEPBF16 | |
BC | VFNMADD231PS | VFNMADD231PD | VFNMADD231PH | VFNMADD231NEPBF16 | |
Scalar negative-multiply-add (-A*B)+C | 9D | VFMADD132SS | VFMADD132SD | VFMADD132SH | |
AD | VFMADD213SS | VFMADD213SD | VFMADD213SH | ||
BD | VFMADD231SS | VFMADD231SD | VFMADD231SH | ||
Packed negative-multiply-subtract (-A*B)-C | 9E | VFNMSUB132PS | VFNMSUB132PD | VFNMSUB132PH | VFNMSUB132NEPBF16 |
AE | VFNMSUB213PS | VFNMSUB213PD | VFNMSUB213PH | VFNMSUB213NEPBF16 | |
BE | VFNMSUB231PS | VFNMSUB231PD | VFNMSUB231PH | VFNMSUB231NEPBF16 | |
Scalar negative-multiply-subtract (-A*B)-C | 9F | VFNMSUB132SS | VFNMSUB132SD | VFNMSUB132SH | |
AF | VFNMSUB213SS | VFNMSUB213SD | VFNMSUB213SH | ||
BF | VFNMSUB231SS | VFNMSUB231SD | VFNMSUB231SH | ||
AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory.[5] Most of the added instructions may also be used with the 256- and 128-bit registers.
See main article: Advanced Matrix Extensions. Intel AMX adds eight new tile-registers, tmm0
-tmm7
, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG
register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.
AMX subset | Instruction mnemonics | Opcode | Instruction description | Added in |
---|---|---|---|---|
LDTILECFG m512 | VEX.128.NP.0F38.W0 49 /0 | Load AMX tile configuration data structure from memory as a 64-byte data structure. | ||
STTILECFG m512 | VEX.128.66.0F38 W0 49 /0 | Store AMX tile configuration data structure to memory. | ||
TILERELEASE | VEX.128.NP.0F38.W0 49 C0 | Initialize TILECFG and tile data registers (tmm0 to tmm7 ) to the INIT state (all-zeroes). | ||
TILEZERO tmm | Zero out contents of one tile register. | |||
TILELOADD tmm, sibmem | Load a data tile from memory into AMX tile register. | |||
VEX.128.66.0F38.W0 4B /r | Load a data tile from memory into AMX tile register, with a hint that data should not be kept in the nearest cache levels. | |||
TILESTORED mem, sibtmm | VEX.128.F3.0F38.W0 4B /r | Store a data tile to memory from AMX tile register. | ||
TDPBSSD tmm1,tmm2,tmm3 | VEX.128.F2.0F38.W0 5E /r | Matrix multiply signed bytes from tmm2 with signed bytes from tmm3, accumulating result in tmm1. | ||
TDPBSUD tmm1,tmm2,tmm3 | VEX.128.F3.0F38.W0 5E /r | Matrix multiply signed bytes from tmm2 with unsigned bytes from tmm3, accumulating result in tmm1. | ||
TDPBUSD tmm1,tmm2,tmm3 | VEX.128.66.0F38.W0 5E /r | Matrix multiply unsigned bytes from tmm2 with signed bytes from tmm3, accumulating result in tmm1. | ||
TDPBUUD tmm1,tmm2,tmm3 | VEX.128.NP.0F38.W0 5E /r | Matrix multiply unsigned bytes from tmm2 with unsigned bytes from tmm3, accumulating result in tmm1. | ||
TDPBF16PS tmm1,tmm2,tmm3 | VEX.128.F3.0F38.W0 5C /r | Matrix multiply BF16 values from tmm2 with BF16 values from tmm3, accumulating result in tmm1. | ||
TDPFP16PS tmm1,tmm2,tmm3 | VEX.128.F2.0F38.W0 5C /r | Matrix multiply FP16 values from tmm2 with FP16 values from tmm3, accumulating result in tmm1. | (Granite Rapids) | |
VEX.128.NP.0F38.W0 6C /r | Matrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating real part of result in tmm1. | |||
VEX.128.66.0F38.W0 6C /r | Matrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating imaginary part of result in tmm1. |