Intel Intrinsics Guide
Intel Intrinsics Guide
Intrinsics Reference
Designers must not rely on the absence or characteristics of any features or instructions
marked "reserved" or "undefined." Intel reserves these for future definition and shall have
no responsibility whatsoever for conflicts or incompatibilities arising from future changes
to them.
The software described in this document may contain software defects which may cause
the product to deviate from published specifications. Current characterized software
defects are available on request.
Intel, the Intel logo, Intel SpeedStep, Intel NetBurst, Intel NetStructure, MMX, Intel386,
Intel486, Celeron, Intel Centrino, Intel Xeon, Intel XScale, Itanium, Pentium, Pentium II
Xeon, Pentium III Xeon, Pentium M, and VTune are trademarks or registered trademarks
of Intel Corporation or its subsidiaries in the United States and other countries.
ii
Table Of Contents
Table Of Contents
Intel(R) C++ Intrinsics Reference...................................................................................... 1
Registers....................................................................................................................2
Data Types.................................................................................................................3
References ....................................................................................................................5
Time Stamp..............................................................................................................15
iii
Intel® C++ Compiler for Linux* Intrinsics Reference
Intrinsics to Read and Write Registers for Streaming SIMD Extensions ................. 83
iv
Table Of Contents
Intrinsics for Dual-Core Intel® Itanium® 2 Processor 9000 Sequence .................. 199
v
Intel® C++ Compiler for Linux* Intrinsics Reference
Data Alignment, Memory Allocation Intrinsics, and Inline Assembly ......................... 210
Overview: Data Alignment, Memory Allocation Intrinsics, and Inline Assembly .... 210
Inline Assembly......................................................................................................212
vi
Intel(R) C++ Intrinsics Reference
Introduction to Intel® C++ Compiler Intrinsics
Several Intel® processors enable development of optimized multimedia applications
through extensions to previously implemented instructions. Applications with media-rich
bit streams can significantly improve performance by using single instruction, multiple
data (SIMD) instructions to process data elements in parallel.
The most direct way to use these instructions is to inline the assembly language
instructions into your source code. However, this process can be time-consuming and
tedious. In addition, your compiler may not support inline assembly language
programming. The Intel® C++ Compiler enables easy implementation of these
instructions through the use of API extension sets built into the compiler. These
extension sets are referred to as intrinsic functions or intrinsics.
The Intel C++ Compiler supports both intrinsics that work on specific architectures and
intrinsics that work across all IA-32 and Itanium®-based platforms.
On processors that do not support Streaming SIMD Extensions 2 (SSE2) instructions but
do support MMX Technology, you can use the sse2mmx.h emulation pack to enable
support for SSE2 instructions. You can use the sse2mmx.h header file for the following
processors:
1
Intel® C++ Compiler for Linux* Intrinsics Reference
Processor Supported
Pentium® 4 Supported Supported Supported Supported Not Supported
Processor
Pentium® III Supported Supported Not Not Supported
Processor Supported
Pentium® II Supported Not Not Not Supported
Processor Supported Supported
Pentium® with Supported Not Not Not Supported
MMX Supported Supported
Technology
Pentium® Pro Not Supported Not Not Not Supported
Processor Supported Supported
Pentium® Not Supported Not Not Not Supported
Processor Supported Supported
Registers
Intel® processors provide special register sets.
The MMX instructions use eight 64-bit registers (mm0 to mm7) which are aliased on the
floating-point stack registers.
The Streaming SIMD Extensions use eight 128-bit registers (xmm0 to xmm7).
Because each of these registers can hold more than one data element, the processor
can process more than one data element simultaneously. This processing capability is
also known as single-instruction multiple data processing (SIMD).
For each computational and data manipulation instruction in the new extension sets,
there is a corresponding C intrinsic that implements that instruction directly. This frees
you from managing registers and assembly programming. Further, the compiler
optimizes the instruction scheduling so that your executable runs faster.
2
Intel(R) C++ Intrinsics Reference
Note
The MM and XMM registers are the SIMD registers used by the IA-32 platforms to
implement MMX technology and SSE or SSE2 intrinsics. On the Itanium-based
platforms, the MMX and SSE intrinsics use the 64-bit general registers and the 64-
bit significand of the 80-bit floating-point register.
Data Types
Intrinsic functions use four new C data types as operands, representing the new
registers that are used as the operands to these intrinsic functions.
The __m128d data type can hold two 64-bit floating-point values.
The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit
integer values.
The compiler aligns __m128d and _m128i local and global data to 16-byte boundaries
on the stack. To align integer, float, or double arrays, you can use the declspec align
statement.
3
Intel® C++ Compiler for Linux* Intrinsics Reference
_mm_<intrin_op>_<suffix>
<intrin_op> Indicates the basic operation of the intrinsic; for example, add for addition
and sub for subtraction.
<suffix> Denotes the type of data the instruction operates on. The first one or two
letters of each suffix denote whether the data is packed (p), extended
packed (ep), or scalar (s). The remaining letters and numbers denote the
type, with notation as follows:
A number appended to a variable name indicates the element of a packed object. For
example, r0 is the lowest word of r. Some intrinsics are "composites" because they
require more than one instruction to implement them.
4
Intel(R) C++ Intrinsics Reference
The packed values are represented in right-to-left order, with the lowest value being
used for scalar operations. Consider the following example operation:
__m128d t = _mm_load_pd(a);
In other words, the xmm register that holds the value t appears as follows:
The "scalar" element is 1.0. Due to the nature of the instruction, some intrinsics require
their arguments to be immediates (constant integer literals).
References
See the following publications and internet locations for more information about intrinsics
and the Intel® architectures that support them. You can find all publications on the Intel
website.
5
Intel® C++ Compiler for Linux* Intrinsics Reference
Code Samples
Dot Product
This code sample demonstrates how to use C, MMX™ technology, and Streaming SIMD
Extensions 3 (SSE3) intrinsics to calculate the dot product of two vectors. The following
outputs are typical of this code when computed by C or SSE3 intrinsics: 506.000000 and
when computed by MMX intrinsics: 506. Output may vary depending on your compiler
version and the components of your computing platform.
SSE3 intrinsics do not run on processors from the Pentium® III family or earlier.
/*
* [Description]
* [Compile]
* [Output]
6
Intel(R) C++ Intrinsics Reference
*/
#include <stdio.h>
#include <pmmintrin.h>
int main()
int i;
float product;
short mmx_product;
x[i]=i;
y[i]=i;
a[i]=i;
b[i]=i;
7
Intel® C++ Compiler for Linux* Intrinsics Reference
#if __INTEL_COMPILER
product =dot_product_intrin(x,y);
mmx_product =MMX_dot_product(a,b);
#else
printf("usng intrinsics\n");
#endif
return 0;
int i;
int sum=0;
sum += a[i]*b[i];
return sum;
#if __INTEL_COMPILER
8
Intel(R) C++ Intrinsics Reference
float arr[4];
float total;
int i;
_mm_store_ss(&total,num4);
return total;
int i;
9
Intel® C++ Compiler for Linux* Intrinsics Reference
//registers
ptr2 = (__m64*)&b[i];
result = _m_to_int(sum);
result= result+data;
return result;
#endif
Double Complex
This code sample demonstrates how to use C, Streaming SIMD Extensions 2 (SSE2)
and Streaming SIMD Extensions 3 (SSE3) intrinsics to multiply two complex numbers.
The following output is typical of this code: 23.00+ -2.00i. Output may vary depending on
your compiler version and the components of your computing platform.
SSE3 intrinsics do not run on processors from the Pentium® III family or earlier.
10
Intel(R) C++ Intrinsics Reference
/*
* [Description]
* [Compile]
* [Output]
*/
#include <stdio.h>
#include <pmmintrin.h>
typedef struct {
double real;
double img;
11
Intel® C++ Compiler for Linux* Intrinsics Reference
} complex_num;
#if __INTEL_COMPILER
num1 = _mm_loaddup_pd(&x.real);
num1 = _mm_loaddup_pd(&x.img);
12
Intel(R) C++ Intrinsics Reference
// ((x.real*y.real)-(x.img*y.img))]
#endif
#if __INTEL_COMPILER
num1 = _mm_load1_pd(&x.real);
num1 = _mm_load1_pd(&x.img);
13
Intel® C++ Compiler for Linux* Intrinsics Reference
#endif
int main()
complex_num a, b, c;
a.real = 3;
a.img = 2;
b.real = 5;
b.img = -4;
multiply_C(a, b, &c);
#if __INTEL_COMPILER
multiply_SSE3(a, b, &c);
multiply_SSE2(a, b, &c);
#endif
14
Intel(R) C++ Intrinsics Reference
return 0;
Time Stamp
This code sample demonstrates how to use the _rdtsc()intrinsic to read the time stamp
counter. The output is the current value of the 64-bit time stamp counter, and therefore
varies each time you compile the code.
/*
* [Description]
* [Compile]
* [Output]
* <varies>
*/
15
Intel® C++ Compiler for Linux* Intrinsics Reference
#include <stdio.h>
int main()
#if __INTEL_COMPILER
int i;
int arr[10000];
start= _rdtsc();
arr[i]=i;
stop= _rdtsc();
#else
printf("_rdtsc intrinsic\n");
#endif
return 0;
16
Intel(R) C++ Intrinsics Reference
• Miscellaneous Intrinsics
Intrinsic Description
int abs(int) Returns the absolute value of an
integer.
long labs(long) Returns the absolute value of a long
integer.
unsigned long lrotl(unsigned long value, Rotates bits left for an unsigned long
int shift) integer.
unsigned long lrotr(unsigned long value, Rotates bits right for an unsigned
int shift) long integer.
unsigned int _rotl(unsigned int value, Rotates bits left for an unsigned
int shift) integer.
unsigned int _rotr(unsigned int value, Rotates bits right for an unsigned
int shift) integer.
Note
Passing a constant shift value in the rotate intrinsics results in higher performance.
Floating-point Intrinsics
The following table lists and describes floating point intrinsics that you can use across all
Intel® architectures.
Intrinsic Description
double fabs(double) Returns the absolute value of a floating-point value.
double log(double) Returns the natural logarithm ln(x), x>0, with double
precision.
float logf(float) Returns the natural logarithm ln(x), x>0, with single
precision.
double log10(double) Returns the base 10 logarithm log10(x), x>0, with
double precision.
17
Intel® C++ Compiler for Linux* Intrinsics Reference
18
Intel(R) C++ Intrinsics Reference
19
Intel® C++ Compiler for Linux* Intrinsics Reference
The string and block copy intrinsics are not implemented as intrinsics on Itanium®-based
platforms.
Intrinsic Description
char *_strset(char *, _int32) Sets all characters in a
string to a fixed value.
int memcmp(const void *cs, const void *ct, size_t n) Compares two regions
of memory. Return <0 if
cs<ct, 0 if cs=ct, or >0
if cs>ct.
void *memcpy(void *s, const void *ct, size_t n) Copies from memory.
Returns s.
void *memset(void * s, int c, size_t n) Sets memory to a fixed
value. Returns s.
char *strcat(char * s, const char * ct) Appends to a string.
Returns s.
int strcmp(const char *, const char *) Compares two strings.
Return <0 if cs<ct, 0 if
cs=ct, or >0 if cs>ct.
char *strcpy(char * s, const char * ct) Copies a string.
Returns s.
size_t strlen(const char * cs) Returns the length of
string cs.
int strncmp(char *, char *, int) Compare two strings,
but only specified
number of characters.
int strncpy(char *, char *, int) Copies a string, but
only specified number
of characters.
Miscellaneous Intrinsics
20
Intel(R) C++ Intrinsics Reference
The following table lists and describes intrinsics that you can use across all Intel®
architectures, except where noted.
Intrinsic Description
_abnormal_termination(void) Can be invoked only by termination handlers.
Returns TRUE if the termination handler is invoked as
a result of a premature exit of the corresponding try-
finally region.
__cpuid Queries the processor for information about
processor type and supported features. The Intel®
C++ Compiler supports the Microsoft*
implementation of this intrinsic. See the Microsoft
documentation for details.
void *_alloca(int) Allocates memory in the local stack frame. The
memory is automatically freed upon return from the
function.
int _bit_scan_forward(int x) Returns the bit index of the least significant set bit of
x. If x is 0, the result is undefined.
int _bit_scan_reverse(int) Returns the bit index of the most significant set bit of
x. If x is 0, the result is undefined.
int _bswap(int) Reverses the byte order of x. Bits 0-7 are swapped
with bits 24-31, and bits 8-15 are swapped with bits
16-23.
_exception_code(void) Returns the exception code.
_exception_info(void) Returns the exception information.
void _enable(void) Enables the interrupt.
void _disable(void) Disables the interrupt.
int _in_byte(int) Intrinsic that maps to the IA-32 instruction IN.
Transfer data byte from port specified by argument.
int _in_dword(int) Intrinsic that maps to the IA-32 instruction IN.
Transfer double word from port specified by
argument.
int _in_word(int) Intrinsic that maps to the IA-32 instruction IN.
Transfer word from port specified by argument.
int _inp(int) Same as _in_byte
int _inpd(int) Same as _in_dword
int _inpw(int) Same as _in_word
int _out_byte(int, int) Intrinsic that maps to the IA-32 instruction OUT.
Transfer data byte in second argument to port
specified by first argument.
int _out_dword(int, int) Intrinsic that maps to the IA-32 instruction OUT.
Transfer double word in second argument to port
specified by first argument.
int _out_word(int, int) Intrinsic that maps to the IA-32 instruction OUT.
Transfer word in second argument to port specified
by first argument.
21
Intel® C++ Compiler for Linux* Intrinsics Reference
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
22
Intel(R) C++ Intrinsics Reference
Caution
Failure to empty the multimedia state after using an MMX instruction and before using a
floating-point instruction can result in unexpected execution or poor performance.
23
Intel® C++ Compiler for Linux* Intrinsics Reference
To see detailed information about an intrinsic, click on that intrinsic in the following table.
void _mm_empty(void)
24
Intel(R) C++ Intrinsics Reference
__m64 _mm_cvtsi32_si64(int i)
Convert the integer object i to a 64-bit __m64 object. The integer value is zero-extended
to 64 bits.
int _mm_cvtsi64_si32(__m64 m)
__m64 _mm_cvtsi64_m64(__int64 i)
__int64 _mm_cvtm64_si64(__m64 m)
Pack the four 16-bit values from m1 into the lower four 8-bit values of the result with
signed saturation, and pack the four 16-bit values from m2 into the upper four 8-bit
values of the result with signed saturation.
Pack the two 32-bit values from m1 into the lower two 16-bit values of the result with
signed saturation, and pack the two 32-bit values from m2 into the upper two 16-bit
values of the result with signed saturation.
Pack the four 16-bit values from m1 into the lower four 8-bit values of the result with
unsigned saturation, and pack the four 16-bit values from m2 into the upper four 8-bit
values of the result with unsigned saturation.
25
Intel® C++ Compiler for Linux* Intrinsics Reference
Interleave the four 8-bit values from the high half of m1 with the four values from the high
half of m2. The interleaving begins with the data from m1.
Interleave the two 16-bit values from the high half of m1 with the two values from the high
half of m2. The interleaving begins with the data from m1.
Interleave the 32-bit value from the high half of m1 with the 32-bit value from the high half
of m2. The interleaving begins with the data from m1.
Interleave the four 8-bit values from the low half of m1 with the four values from the low
half of m2. The interleaving begins with the data from m1.
Interleave the two 16-bit values from the low half of m1 with the two values from the low
half of m2. The interleaving begins with the data from m1.
Interleave the 32-bit value from the low half of m1 with the 32-bit value from the low half
of m2. The interleaving begins with the data from m1.
26
Intel(R) C++ Intrinsics Reference
27
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on the name of the intrinsic in the
following table.
28
Intel(R) C++ Intrinsics Reference
Add the eight 8-bit values in m1 to the eight 8-bit values in m2.
Add the four 16-bit values in m1 to the four 16-bit values in m2.
Add the two 32-bit values in m1 to the two 32-bit values in m2.
Add the eight signed 8-bit values in m1 to the eight signed 8-bit values in m2 using
saturating arithmetic.
29
Intel® C++ Compiler for Linux* Intrinsics Reference
Add the four signed 16-bit values in m1 to the four signed 16-bit values in m2 using
saturating arithmetic.
Add the eight unsigned 8-bit values in m1 to the eight unsigned 8-bit values in m2 and
using saturating arithmetic.
Add the four unsigned 16-bit values in m1 to the four unsigned 16-bit values in m2 using
saturating arithmetic.
Subtract the eight 8-bit values in m2 from the eight 8-bit values in m1.
Subtract the four 16-bit values in m2 from the four 16-bit values in m1.
Subtract the two 32-bit values in m2 from the two 32-bit values in m1.
Subtract the eight signed 8-bit values in m2 from the eight signed 8-bit values in m1 using
saturating arithmetic.
Subtract the four signed 16-bit values in m2 from the four signed 16-bit values in m1 using
saturating arithmetic.
30
Intel(R) C++ Intrinsics Reference
Subtract the eight unsigned 8-bit values in m2 from the eight unsigned 8-bit values in m1
using saturating arithmetic.
Subtract the four unsigned 16-bit values in m2 from the four unsigned 16-bit values in m1
using saturating arithmetic.
Multiply four 16-bit values in m1 by four 16-bit values in m2 producing four 32-bit
intermediate results, which are then summed by pairs to produce two 32-bit results.
Multiply four signed 16-bit values in m1 by four signed 16-bit values in m2 and produce
the high 16 bits of the four results.
Multiply four 16-bit values in m1 by four 16-bit values in m2 and produce the low 16 bits of
the four results.
31
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on the name of the intrinsic in the
following table.
32
Intel(R) C++ Intrinsics Reference
Shift four 16-bit values in m left the amount specified by count while shifting in zeros.
Shift four 16-bit values in m left the amount specified by count while shifting in zeros. For
the best performance, count should be a constant.
Shift two 32-bit values in m left the amount specified by count while shifting in zeros.
Shift two 32-bit values in m left the amount specified by count while shifting in zeros. For
the best performance, count should be a constant.
Shift the 64-bit value in m left the amount specified by count while shifting in zeros.
33
Intel® C++ Compiler for Linux* Intrinsics Reference
Shift the 64-bit value in m left the amount specified by count while shifting in zeros. For
the best performance, count should be a constant.
Shift four 16-bit values in m right the amount specified by count while shifting in the sign
bit.
Shift four 16-bit values in m right the amount specified by count while shifting in the sign
bit. For the best performance, count should be a constant.
Shift two 32-bit values in m right the amount specified by count while shifting in the sign
bit.
Shift two 32-bit values in m right the amount specified by count while shifting in the sign
bit. For the best performance, count should be a constant.
Shift four 16-bit values in m right the amount specified by count while shifting in zeros.
Shift four 16-bit values in m right the amount specified by count while shifting in zeros.
For the best performance, count should be a constant.
Shift two 32-bit values in m right the amount specified by count while shifting in zeros.
34
Intel(R) C++ Intrinsics Reference
Shift two 32-bit values in m right the amount specified by count while shifting in zeros.
For the best performance, count should be a constant.
Shift the 64-bit value in m right the amount specified by count while shifting in zeros.
Shift the 64-bit value in m right the amount specified by count while shifting in zeros. For
the best performance, count should be a constant.
35
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic in the following table.
Perform a bitwise AND of the 64-bit value in m1 with the 64-bit value in m2.
Perform a bitwise NOT on the 64-bit value in m1 and use the result in a bitwise AND with
the 64-bit value in m2.
36
Intel(R) C++ Intrinsics Reference
Perform a bitwise OR of the 64-bit value in m1 with the 64-bit value in m2.
Perform a bitwise XOR of the 64-bit value in m1 with the 64-bit value in m2.
37
Intel® C++ Compiler for Linux* Intrinsics Reference
The intrinsics in the following table perform compare operations. For a more detailed
description of an intrinsic, click on that intrinsic in the table.
If the respective 8-bit values in m1 are equal to the respective 8-bit values in m2 set the
respective 8-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 16-bit values in m1 are equal to the respective 16-bit values in m2 set the
respective 16-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 32-bit values in m1 are equal to the respective 32-bit values in m2 set the
respective 32-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 8-bit signed values in m1 are greater than the respective 8-bit signed
values in m2 set the respective 8-bit resulting values to all ones, otherwise set them to all
zeros.
38
Intel(R) C++ Intrinsics Reference
If the respective 16-bit signed values in m1 are greater than the respective 16-bit signed
values in m2 set the respective 16-bit resulting values to all ones, otherwise set them to
all zeros.
If the respective 32-bit signed values in m1 are greater than the respective 32-bit signed
values in m2 set the respective 32-bit resulting values to all ones, otherwise set them all
to zeros.
39
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic in the following table.
Note
In the descriptions regarding the bits of the MMX register, bit 0 is the least significant
and bit 63 is the most significant.
__m64 _mm_setzero_si64()
Sets the 64-bit value to zero.
R
0x0
R0 R1
40
Intel(R) C++ Intrinsics Reference
i0 i1
R0 R1 R2 R3
w0 w1 w2 w3
__m64 _mm_set_pi8(char b7, char b6, char b5, char b4, char b3, char b2,
char b1, char b0)
R0 R1 ... R7
b0 b1 ... b7
__m64 _mm_set1_pi32(int i)
R0 R1
i i
__m64 _mm_set1_pi16(short s)
R0 R1 R2 R3
w w w w
__m64 _mm_set1_pi8(char b)
41
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1 ... R7
b b ... b
R0 R1
i1 i0
R0 R1 R2 R3
w3 w2 w1 w0
__m64 _mm_setr_pi8(char b7, char b6, char b5, char b4, char b3, char b2,
char b1, char b0)
R0 R1 ... R7
b7 b6 ... b0
42
Intel(R) C++ Intrinsics Reference
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
Data Types
The C data type __m64 is used when using MMX technology intrinsics. It can hold eight
8-bit values, four 16-bit values, two 32-bit values, or one 64-bit value.
The __m64 data type is not a basic ANSI C data type. Therefore, observe the following
usage restrictions:
• Use the new data type only on the left-hand side of an assignment, as a return
value, or as a parameter. You cannot use it with other arithmetic expressions (" +
", " - ", and so on).
• Use the new data type as objects in aggregates, such as unions, to access the
byte elements and structures; the address of an __m64 object may be taken.
43
Intel® C++ Compiler for Linux* Intrinsics Reference
• Use new data types only with the respective intrinsics described in this
documentation.
For complete details of the hardware instructions, see the Intel® Architecture MMX™
Technology Programmer's Reference Manual. For descriptions of data types, see the
Intel® Architecture Software Developer's Manual, Volume 2.
The prototypes for SSE intrinsics are in the xmmintrin.h header file.
Note
You can also use the single ia32intrin.h header file for any IA-32 intrinsics.
44
Intel(R) C++ Intrinsics Reference
The results of each intrinsic operation are placed in a register. This register is illustrated
for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the 4 32-bit
pieces of the result register.
To see detailed information about an intrinsic, click on that intrinsic name in the following
table.
45
Intel® C++ Compiler for Linux* Intrinsics Reference
Adds the lower single-precision, floating-point (SP FP) values of a and b; the upper 3 SP
FP values are passed through from a.
R0 R1 R2 R3
a0 + b0 a1 a2 a3
R0 R1 R2 R3
a0 +b0 a1 + b1 a2 + b2 a3 + b3
Subtracts the lower SP FP values of a and b. The upper 3 SP FP values are passed
through from a.
R0 R1 R2 R3
a0 - b0 a1 a2 a3
R0 R1 R2 R3
a0 - b0 a1 - b1 a2 - b2 a3 - b3
46
Intel(R) C++ Intrinsics Reference
Multiplies the lower SP FP values of a and b; the upper 3 SP FP values are passed
through from a.
R0 R1 R2 R3
a0 * b0 a1 a2 a3
R0 R1 R2 R3
a0 * b0 a1 * b1 a2 * b2 a3 * b3
Divides the lower SP FP values of a and b; the upper 3 SP FP values are passed
through from a.
R0 R1 R2 R3
a0 / b0 a1 a2 a3
R0 R1 R2 R3
a0 / b0 a1 / b1 a2 / b2 a3 / b3
__m128 _mm_sqrt_ss(__m128 a)
Computes the square root of the lower SP FP value of a ; the upper 3 SP FP values are
passed through.
47
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1 R2 R3
sqrt(a0) a1 a2 a3
__m128 _mm_sqrt_ps(__m128 a)
R0 R1 R2 R3
sqrt(a0) sqrt(a1) sqrt(a2) sqrt(a3)
__m128 _mm_rcp_ss(__m128 a)
Computes the approximation of the reciprocal of the lower SP FP value of a; the upper 3
SP FP values are passed through.
R0 R1 R2 R3
recip(a0) a1 a2 a3
__m128 _mm_rcp_ps(__m128 a)
R0 R1 R2 R3
recip(a0) recip(a1) recip(a2) recip(a3)
__m128 _mm_rsqrt_ss(__m128 a)
Computes the approximation of the reciprocal of the square root of the lower SP FP
value of a; the upper 3 SP FP values are passed through.
R0 R1 R2 R3
recip(sqrt(a0)) a1 a2 a3
__m128 _mm_rsqrt_ps(__m128 a)
48
Intel(R) C++ Intrinsics Reference
Computes the approximations of the reciprocals of the square roots of the four SP FP
values of a.
R0 R1 R2 R3
recip(sqrt(a0)) recip(sqrt(a1)) recip(sqrt(a2)) recip(sqrt(a3))
Computes the minimum of the lower SP FP values of a and b; the upper 3 SP FP values
are passed through from a.
R0 R1 R2 R3
min(a0, b0) a1 a2 a3
R0 R1 R2 R3
min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3)
R0 R1 R2 R3
max(a0, b0) a1 a2 a3
R0 R1 R2 R3
max(a0, b0) max(a1, b1) max(a2, b2) max(a3, b3)
49
Intel® C++ Compiler for Linux* Intrinsics Reference
The results of each intrinsic operation are placed in a register. This register is illustrated
for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the 4 32-bit
pieces of the result register.
50
Intel(R) C++ Intrinsics Reference
To see detailed information about an intrinsic, click on that intrinsic name in the following
table.
R0 R1 R2 R3
a0 & b0 a1 & b1 a2 & b2 a3 & b3
R0 R1 R2 R3
~a0 & b0 ~a1 & b1 ~a2 & b2 ~a3 & b3
R0 R1 R2 R3
a0 | b0 a1 | b1 a2 | b2 a3 | b3
R0 R1 R2 R3
51
Intel® C++ Compiler for Linux* Intrinsics Reference
a0 ^ b0 a1 ^ b1 a2 ^ b2 a3 ^ b3
52
Intel(R) C++ Intrinsics Reference
53
Intel® C++ Compiler for Linux* Intrinsics Reference
To see detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in a register. This register is illustrated
for each intrinsic with R or R0-R3. R0, R1, R2 and R3 each represent one of the 4 32-bit
pieces of the result register.
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h
header file.
54
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
(a0 == b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
(a0 == b0) ? (a1 == b1) ? (a2 == b2) ? (a3 == b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
55
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1 R2 R3
(a0 < b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
(a0 < b0) ? (a1 < b1) ? (a2 < b2) ? (a3 < b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 <= b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
(a0 <= b0) ? (a1 <= b1) ? (a2 <= b2) ? (a3 <= b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
56
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
(a0 > b0) ? (a1 > b1) ? (a2 > b2) ? (a3 > b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 >= b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
(a0 >= b0) ? (a1 >= b1) ? (a2 >= b2) ? (a3 >= b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 != b0) ? 0xffffffff : 0x0 a1 a2 a3
57
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1 R2 R3
(a0 != b0) ? (a1 != b1) ? (a2 != b2) ? (a3 != b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
!(a0 < b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
!(a0 < b0) ? !(a1 < b1) ? !(a2 < b2) ? !(a3 < b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
!(a0 <= b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
!(a0 <= b0) ? !(a1 <= b1) ? !(a2 <= b2) ? !(a3 <= b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
58
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
!(a0 > b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
!(a0 > b0) ? !(a1 > b1) ? !(a2 > b2) ? !(a3 > b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
!(a0 >= b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
!(a0 >= b0) ? !(a1 >= b1) ? !(a2 >= b2) ? !(a3 >= b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 ord? b0) ? 0xffffffff : 0x0 a1 a2 a3
59
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1 R2 R3
(a0 ord? b0) ? (a1 ord? b1) ? (a2 ord? b2) ? (a3 ord? b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 unord? b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
(a0 unord? b0) ? (a1 unord? b1) ? (a2 unord? b2) ? (a3 unord? b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
Compares the lower SP FP value of a and b for a equal to b. If a and b are equal, 1 is
returned. Otherwise 0 is returned.
R
(a0 == b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a less than b. If a is less than b, 1 is
returned. Otherwise 0 is returned.
60
Intel(R) C++ Intrinsics Reference
Compares the lower SP FP value of a and b for a less than or equal to b. If a is less than
or equal to b, 1 is returned. Otherwise 0 is returned.
R
(a0 <= b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a greater than b. If a is greater than b
are equal, 1 is returned. Otherwise 0 is returned.
R
(a0 > b0) ? 0x1 : 0x0
R
(a0 >= b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a not equal to b. If a and b are not equal,
1 is returned. Otherwise 0 is returned.
R
(a0 != b0) ? 0x1 : 0x0
61
Intel® C++ Compiler for Linux* Intrinsics Reference
Compares the lower SP FP value of a and b for a equal to b. If a and b are equal, 1 is
returned. Otherwise 0 is returned.
R
(a0 == b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a less than b. If a is less than b, 1 is
returned. Otherwise 0 is returned.
R
(a0 < b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a less than or equal to b. If a is less than
or equal to b, 1 is returned. Otherwise 0 is returned.
R
(a0 <= b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a greater than b. If a is greater than or
equal to b, 1 is returned. Otherwise 0 is returned.
R
(a0 > b0) ? 0x1 : 0x0
R
(a0 >= b0) ? 0x1 : 0x0
62
Intel(R) C++ Intrinsics Reference
Compares the lower SP FP value of a and b for a not equal to b. If a and b are not equal,
1 is returned. Otherwise 0 is returned.
R
r := (a0 != b0) ? 0x1 : 0x0
63
Intel® C++ Compiler for Linux* Intrinsics Reference
The results of each intrinsic operation are placed in a register. This register is illustrated
for each intrinsic with R or R0-R3. R0, R1, R2 and R3 each represent one of the 4 32-bit
pieces of the result register.
To see detailed information about an intrinsic, click on that intrinsic name in the following
table.
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h
header file.
64
Intel(R) C++ Intrinsics Reference
int _mm_cvtss_si32(__m128 a)
Convert the lower SP FP value of a to a 32-bit integer according to the current rounding
mode.
R
(int)a0
__int64 _mm_cvtss_si64(__m128 a)
Convert the lower SP FP value of a to a 64-bit signed integer according to the current
rounding mode.
R
(__int64)a0
__m64 _mm_cvtps_pi32(__m128 a)
Convert the two lower SP FP values of a to two 32-bit integers according to the current
rounding mode, returning the integers in packed form.
R0 R1
(int)a0 (int)a1
int _mm_cvttss_si32(__m128 a)
Convert the lower SP FP value of a to a 32-bit integer with truncation.
R
(int)a0
__int64 _mm_cvttss_si64(__m128 a)
Convert the lower SP FP value of a to a 64-bit signed integer with truncation.
R
(__int64)a0
__m64 _mm_cvttps_pi32(__m128 a)
Convert the two lower SP FP values of a to two 32-bit integer with truncation, returning
the integers in packed form.
R0 R1
65
Intel® C++ Compiler for Linux* Intrinsics Reference
(int)a0 (int)a1
Convert the 32-bit integer value b to an SP FP value; the upper three SP FP values are
passed through from a.
R0 R1 R2 R3
(float)b a1 a2 a3
Convert the signed 64-bit integer value b to an SP FP value; the upper three SP FP
values are passed through from a.
R0 R1 R2 R3
(float)b a1 a2 a3
Convert the two 32-bit integer values in packed form in b to two SP FP values; the upper
two SP FP values are passed through from a.
R0 R1 R2 R3
(float)b0 (float)b1 a2 a3
__m128 _mm_cvtpi16_ps(__m64 a)
Convert the four 16-bit signed integer values in a to four single precision FP values.
R0 R1 R2 R3
(float)a0 (float)a1 (float)a2 (float)a3
__m128 _mm_cvtpu16_ps(__m64 a)
66
Intel(R) C++ Intrinsics Reference
Convert the four 16-bit unsigned integer values in a to four single precision FP values.
R0 R1 R2 R3
(float)a0 (float)a1 (float)a2 (float)a3
__m128 _mm_cvtpi8_ps(__m64 a)
Convert the lower four 8-bit signed integer values in a to four single precision FP values.
R0 R1 R2 R3
(float)a0 (float)a1 (float)a2 (float)a3
__m128 _mm_cvtpu8_ps(__m64 a)
Convert the lower four 8-bit unsigned integer values in a to four single precision FP
values.
R0 R1 R2 R3
(float)a0 (float)a1 (float)a2 (float)a3
Convert the two 32-bit signed integer values in a and the two 32-bit signed integer
values in b to four single precision FP values.
R0 R1 R2 R3
(float)a0 (float)a1 (float)b0 (float)b1
__m64 _mm_cvtps_pi16(__m128 a)
Convert the four single precision FP values in a to four signed 16-bit integer values.
R0 R1 R2 R3
(short)a0 (short)a1 (short)a2 (short)a3
67
Intel® C++ Compiler for Linux* Intrinsics Reference
__m64 _mm_cvtps_pi8(__m128 a)
Convert the four single precision FP values in a to the lower four signed 8-bit integer
values of the result.
R0 R1 R2 R3
(char)a0 (char)a1 (char)a2 (char)a3
float _mm_cvtss_f32(__m128 a)
This intrinsic extracts a single precision floating point value from the first vector element
of an __m128. It does so in the most effecient manner possible in the context used.
68
Intel(R) C++ Intrinsics Reference
To see detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in a register. This register is illustrated
for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the 4 32-bit
pieces of the result register.
Sets the upper two SP FP values with 64 bits of data loaded from the address p.
R0 R1 R2 R3
a0 a1 *p0 *p1
69
Intel® C++ Compiler for Linux* Intrinsics Reference
Sets the lower two SP FP values with 64 bits of data loaded from the address p; the
upper two values are passed through from a.
R0 R1 R2 R3
*p0 *p1 a2 a3
__m128 _mm_load_ss(float * p )
Loads an SP FP value into the low word and clears the upper three words.
R0 R1 R2 R3
*p 0.0 0.0 0.0
__m128 _mm_load1_ps(float * p )
R0 R1 R2 R3
*p *p *p *p
__m128 _mm_load_ps(float * p )
R0 R1 R2 R3
p[0] p[1] p[2] p[3]
__m128 _mm_loadu_ps(float * p)
R0 R1 R2 R3
p[0] p[1] p[2] p[3]
70
Intel(R) C++ Intrinsics Reference
__m128 _mm_loadr_ps(float * p)
R0 R1 R2 R3
p[3] p[2] p[1] p[0]
71
Intel® C++ Compiler for Linux* Intrinsics Reference
To see detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R0, R1, R2 and R3 represent the registers in which results are placed.
__m128 _mm_set_ss(float w )
Sets the low word of an SP FP value to w and clears the upper three words.
R0 R1 R2 R3
w 0.0 0.0 0.0
__m128 _mm_set1_ps(float w )
72
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
w w w w
R0 R1 R2 R3
w x y z
R0 R1 R2 R3
z y x w
R0 R1 R2 R3
0.0 0.0 0.0 0.0
73
Intel® C++ Compiler for Linux* Intrinsics Reference
The detailed description of each intrinsic contains a table detailing the returns. In these
tables, p[n] is an access to the n element of the result.
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h
header file.
74
Intel(R) C++ Intrinsics Reference
*p0 *p1
a2 a3
*p0 *p1
a0 a1
*p
a0
75
Intel® C++ Compiler for Linux* Intrinsics Reference
76
Intel(R) C++ Intrinsics Reference
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h
header file.
Loads one cache line of data from address a to a location "closer" to the processor. The
value sel specifies the type of prefetch operation: the constants _MM_HINT_T0,
_MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA should be used for IA-32,
corresponding to the type of prefetch instruction. The constants _MM_HINT_T1,
_MM_HINT_NT1, _MM_HINT_NT2, and _MM_HINT_NTA should be used for Itanium®-based
systems.
77
Intel® C++ Compiler for Linux* Intrinsics Reference
Stores the data in a to the address p without polluting the caches. This intrinsic requires
you to empty the multimedia state for the mmx register. See The EMMS Instruction: Why
You Need It.
Stores the data in a to the address p without polluting the caches. The address must be
16-byte-aligned.
void _mm_sfence(void)
Guarantees that every preceding store is globally visible before any subsequent store.
78
Intel(R) C++ Intrinsics Reference
To see detailed information about an intrinsic, click on that intrinsic name in the following
table.
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h
header file.The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
Before using these intrinsics, you must empty the multimedia state for the MMX(TM)
technology register. See The EMMS Instruction: Why You Need It for more details.
79
Intel® C++ Compiler for Linux* Intrinsics Reference
R
(n==0) ? a0 : ( (n==1) ? a1 : ( (n==2) ? a2 : a3 ) )
Inserts word d into one of four words of a. The selector n must be an immediate.
R0 R1 R2 R3
(n==0) ? d : a0; (n==1) ? d : a1; (n==2) ? d : a2; (n==3) ? d : a3;
R0 R1 R2 R3
min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3)
R0 R1 ... R7
min(a0, b0) min(a1, b1) ... min(a7, b7)
80
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3)
R0 R1 ... R7
min(a0, b0) min(a1, b1) ... min(a7, b7)
__m64 _mm_movemask_pi8(__m64 b)
Creates an 8-bit mask from the most significant bits of the bytes in a.
R
sign(a7)<<7 | sign(a6)<<6 |... | sign(a0)
Multiplies the unsigned words in a and b, returning the upper 16 bits of the 32-bit
intermediate results.
R0 R1 R2 R3
hiword(a0 * b0) hiword(a1 * b1) hiword(a2 * b2) hiword(a3 * b3)
R0 R1 R2 R3
word (n&0x3) word ((n>>2)&0x3) word ((n>>4)&0x3) word ((n>>6)&0x3)
of a of a of a of a
81
Intel® C++ Compiler for Linux* Intrinsics Reference
Conditionally store byte elements of d to address p. The high bit of each byte in the
selector n determines whether the corresponding byte in d will be stored.
R0 R1 ... R7
(t >> 1) | (t & 0x01), (t >> 1) | (t & 0x01), ... ((t >> 1) | (t &
where t = (unsigned where t = (unsigned 0x01)), where t =
char)a0 + (unsigned char)a1 + (unsigned (unsigned char)a7 +
char)b0 char)b1 (unsigned char)b7
R0 R1 ... R7
(t >> 1) | (t & 0x01), where t (t >> 1) | (t & 0x01), where t ... (t >> 1) | (t & 0x01), where t
= (unsigned int)a0 + = (unsigned int)a1 + = (unsigned int)a7 +
(unsigned int)b0 (unsigned int)b1 (unsigned int)b7
Computes the sum of the absolute differences of the unsigned bytes in a and b,
returning the value in the lower word. The upper three words are cleared.
R0 R1 R2 R3
abs(a0-b0) +... + abs(a7-b7) 0 0 0
82
Intel(R) C++ Intrinsics Reference
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the xmmintrin.h
header file.
83
Intel® C++ Compiler for Linux* Intrinsics Reference
84
Intel(R) C++ Intrinsics Reference
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R, R0, R1, R2 and R3 represent the registers in which results are placed.
To see detailed information about an intrinsic, click on that intrinsic name in the following
table.
Selects four specific SP FP values from a and b, based on the mask imm8. The mask
must be an immediate. See Macro Function for Shuffle Using Streaming SIMD
Extensions for a description of the shuffle semantics.
85
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1 R2 R3
a2 b2 a3 b3
R0 R1 R2 R3
a0 b0 a1 b1
Sets the low word to the SP FP value of b. The upper 3 SP FP values are
passed through from a.
R0 R1 R2 R3
b0 a1 a2 a3
Moves the upper 2 SP FP values of b to the lower 2 SP FP values of the result. The
upper 2 SP FP values of a are passed through to the result.
R0 R1 R2 R3
b2 b3 a2 a3
Moves the lower 2 SP FP values of b to the upper 2 SP FP values of the result. The
lower 2 SP FP values of a are passed through to the result.
R0 R1 R2 R3
a0 a1 b0 b1
86
Intel(R) C++ Intrinsics Reference
int _mm_movemask_ps(__m128 a)
Creates a 4-bit mask from the most significant bits of the four SP FP values.
R
sign(a3)<<3 | sign(a2)<<2 | sign(a1)<<1 | sign(a0)
87
Intel® C++ Compiler for Linux* Intrinsics Reference
To write programs with the intrinsics, you should be familiar with the hardware features
provided by SSE. Keep the following issues in mind:
• Certain intrinsics are provided only for compatibility with previously-defined IA-32
intrinsics. Using them on Itanium-based systems probably leads to performance
degradation.
• Floating-point (FP) data loaded stored as __m128 objects must be 16-byte-
aligned.
• Some intrinsics require that their arguments be immediates -- that is, constant
integers (literals), due to the nature of the instruction.
Data Types
The new data type __m128 is used with the SSE intrinsics. It represents a 128-bit
quantity composed of four single-precision FP values. This corresponds to the 128-bit
IA-32 Streaming SIMD Extensions register.
The compiler aligns __m128 local data to 16-byte boundaries on the stack. Global data of
these types is also 16 byte-aligned. To align integer, float, or double arrays, you can
use the declspec alignment.
Because Itanium instructions treat the SSE registers in the same way whether you are
using packed or scalar data, there is no __m32 data type to represent scalar data. For
scalar operations, use the __m128 objects and the "scalar" forms of the intrinsics; the
compiler and the processor implement these operations with 32-bit memory references.
88
Intel(R) C++ Intrinsics Reference
But, for better performance the packed form should be substituting for the scalar form
whenever possible.
For more information, see Intel Architecture Software Developer's Manual, Volume 2:
Instruction Set Reference Manual, Intel Corporation, doc. number 243191.
SSE intrinsics are defined for the __m128 data type, a 128-bit quantity consisting of four
single-precision FP values. SIMD instructions for Itanium-based systems operate on 64-
bit FP register quantities containing two single-precision floating-point values. Thus,
each __m128 operand is actually a pair of FP registers and therefore each intrinsic
corresponds to at least one pair of Itanium instructions operating on the pair of FP
register operands.
The following intrinsics are likely to reduce performance and should only be used to
initially port legacy code or in non-critical code sections:
• Any SSE scalar intrinsic (_ss variety) - use packed (_ps) version if possible
• comi and ucomi SSE comparisons - these correspond to IA-32 COMISS and
UCOMISS instructions only. A sequence of Itanium instructions are required to
implement these.
• Conversions in general are multi-instruction operations. These are particularly
expensive: _mm_cvtpi16_ps, _mm_cvtpu16_ps, _mm_cvtpi8_ps,
_mm_cvtpu8_ps, _mm_cvtpi32x2_ps, _mm_cvtps_pi16, _mm_cvtps_pi8
• SSE utility intrinsic _mm_movemask_ps
If the inaccuracy is acceptable, the SIMD reciprocal and reciprocal square root
approximation intrinsics (rcp and rsqrt) are much faster than the true div and sqrt
intrinsics.
Macro Functions
89
Intel® C++ Compiler for Linux* Intrinsics Reference
The Streaming SIMD Extensions (SSE) provide a macro function to help create
constants that describe shuffle operations. The macro takes four small integers (in the
range of 0 to 3) and combines them into an 8-bit immediate value used by the SHUFPS
instruction.
You can view the four integers as selectors for choosing which two words from the first
input operand and which two words from the second are to be put into the result word.
90
Intel(R) C++ Intrinsics Reference
The following example masks the overflow and underflow exceptions and unmasks all
other exceptions.
The following example tests the rounding mode for round toward zero.
91
Intel® C++ Compiler for Linux* Intrinsics Reference
The arguments row0, row1, row2, and row3 are __m128 values whose elements form the
corresponding rows of a 4 by 4 matrix. The matrix transposition is returned in arguments
row0, row1, row2, and row3 where row0 now holds column 0 of the original matrix, row1
now holds column 1 of the original matrix, and so on.
The transposition function of this macro is illustrated in the "Matrix Transposition Using
the _MM_TRANSPOSE4_PS" figure.
92
Intel(R) C++ Intrinsics Reference
Note
There are no intrinsics for floating-point move operations. To move data from one
register to another, a simple assignment, A = B, suffices, where A and B are the source
and target registers for the move operation.
Note
On processors that do not support SSE2 instructions but do support MMX Technology,
you can use the sse2mmx.h emulation pack to enable support for SSE2 instructions.
You can use the sse2mmx.h header file for the following processors:
• Itanium® Processor
• Pentium® III Processor
• Pentium® II Processor
• Pentium® with MMX™ Technology
You should be familiar with the hardware features provided by the SSE2 when writing
programs with the intrinsics. The following are three important issues to keep in mind:
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
Note
93
Intel® C++ Compiler for Linux* Intrinsics Reference
You can also use the single ia32intrin.h header file for any IA-32 intrinsics.
Floating-point Intrinsics
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in a register. This register is illustrated
for each intrinsic with R0 and R1. R0 and R1 each represent one piece of the result
register.
The Double Complex code sample contains examples of how to use several of these
intrinsics.
94
Intel(R) C++ Intrinsics Reference
R0 R1
a0 + b0 a1
R0 R1
a0 + b0 a1 + b1
Subtracts the lower DP FP value of b from a. The upper DP FP value is passed through
from a.
R0 R1
a0 - b0 a1
R0 R1
a0 - b0 a1 - b1
Multiplies the lower DP FP values of a and b. The upper DP FP is passed through from
a.
R0 R1
95
Intel® C++ Compiler for Linux* Intrinsics Reference
a0 * b0 a1
R0 R1
a0 * b0 a1 * b1
Divides the lower DP FP values of a and b. The upper DP FP value is passed through
from a.
R0 R1
a0 / b0 a1
R0 R1
a0 / b0 a1 / b1
Computes the square root of the lower DP FP value of b. The upper DP FP value is
passed through from a.
R0 R1
sqrt(b0) a1
__m128d _mm_sqrt_pd(__m128d a)
96
Intel(R) C++ Intrinsics Reference
R0 R1
sqrt(a0) sqrt(a1)
Computes the minimum of the lower DP FP values of a and b. The upper DP FP value is
passed through from a.
R0 R1
min (a0, b0) a1
R0 R1
min (a0, b0) min(a1, b1)
Computes the maximum of the lower DP FP values of a and b. The upper DP FP value
is passed through from a.
R0 R1
max (a0, b0) a1
R0 R1
max (a0, b0) max (a1, b1)
97
Intel® C++ Compiler for Linux* Intrinsics Reference
The prototypes for Streaming SIMD Extensions 2 (SSE2) intrinsics are in the
emmintrin.h header file.
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R0 and R1 represent the registers in which results are placed.
R0 R1
a0 & b0 a1 & b1
Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the 128-bit
value in a.
R0 R1
(~a0) & b0 (~a1) & b1
R0 R1
a0 | b0 a1 | b1
99
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1
a0 ^ b0 a1 ^ b1
100
Intel(R) C++ Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in a register. This register is illustrated
for each intrinsic with R, R0 and R1. R, R0 and R1 each represent one piece of the
result register.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
101
Intel® C++ Compiler for Linux* Intrinsics Reference
102
Intel(R) C++ Intrinsics Reference
R0 R1
(a0 == b0) ? 0xffffffffffffffff : (a1 == b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
(a0 < b0) ? 0xffffffffffffffff : (a1 < b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
(a0 <= b0) ? 0xffffffffffffffff : (a1 <= b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
(a0 > b0) ? 0xffffffffffffffff : (a1 > b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
(a0 >= b0) ? 0xffffffffffffffff : (a1 >= b1) ? 0xffffffffffffffff :
103
Intel® C++ Compiler for Linux* Intrinsics Reference
0x0 0x0
R0 R1
(a0 ord b0) ? 0xffffffffffffffff : (a1 ord b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
(a0 unord b0) ? 0xffffffffffffffff (a1 unord b1) ? 0xffffffffffffffff
: 0x0 : 0x0
R0 R1
(a0 != b0) ? 0xffffffffffffffff : (a1 != b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
!(a0 < b0) ? 0xffffffffffffffff : !(a1 < b1) ? 0xffffffffffffffff :
0x0 0x0
104
Intel(R) C++ Intrinsics Reference
Compares the two DP FP values of a and b for a not less than or equal to b.
R0 R1
!(a0 <= b0) ? 0xffffffffffffffff : !(a1 <= b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
!(a0 > b0) ? 0xffffffffffffffff : !(a1 > b1) ? 0xffffffffffffffff :
0x0 0x0
Compares the two DP FP values of a and b for a not greater than or equal to b.
R0 R1
!(a0 >= b0) ? 0xffffffffffffffff : !(a1 >= b1) ? 0xffffffffffffffff :
0x0 0x0
Compares the lower DP FP value of a and b for equality. The upper DP FP value is
passed through from a.
R0 R1
(a0 == b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a less than b. The upper DP FP value is
passed through from a.
R0 R1
(a0 < b0) ? 0xffffffffffffffff : 0x0 a1
105
Intel® C++ Compiler for Linux* Intrinsics Reference
Compares the lower DP FP value of a and b for a less than or equal to b. The upper DP
FP value is passed through from a.
R0 R1
(a0 <= b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a greater than b. The upper DP FP
value is passed through from a.
R0 R1
(a0 > b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a greater than or equal to b. The upper
DP FP value is passed through from a.
R0 R1
(a0 >= b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for ordered. The upper DP FP value is
passed through from a.
R0 R1
(a0 ord b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for unordered. The upper DP FP value is
passed through from a.
106
Intel(R) C++ Intrinsics Reference
R0 R1
(a0 unord b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for inequality. The upper DP FP value is
passed through from a.
R0 R1
(a0 != b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a not less than b. The upper DP FP
value is passed through from a.
R0 R1
!(a0 < b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a not less than or equal to b. The upper
DP FP value is passed through from a.
R0 R1
!(a0 <= b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a not greater than b. The upper DP FP
value is passed through from a.
R0 R1
!(a0 > b0) ? 0xffffffffffffffff : 0x0 a1
107
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1
!(a0 >= b0) ? 0xffffffffffffffff : 0x0 a1
R
(a0 == b0) ? 0x1 : 0x0
R
(a0 < b0) ? 0x1 : 0x0
R
(a0 <= b0) ? 0x1 : 0x0
R
(a0 > b0) ? 0x1 : 0x0
108
Intel(R) C++ Intrinsics Reference
R
(a0 >= b0) ? 0x1 : 0x0
R
(a0 != b0) ? 0x1 : 0x0
R
(a0 == b0) ? 0x1 : 0x0
R
(a0 < b0) ? 0x1 : 0x0
109
Intel® C++ Compiler for Linux* Intrinsics Reference
R
(a0 <= b0) ? 0x1 : 0x0
R
(a0 > b0) ? 0x1 : 0x0
R
(a0 >= b0) ? 0x1 : 0x0
R
(a0 != b0) ? 0x1 : 0x0
110
Intel(R) C++ Intrinsics Reference
111
Intel® C++ Compiler for Linux* Intrinsics Reference
The conversion-operation intrinsics for Streaming SIMD Extensions 2 (SSE2) are listed
in the following table followed by detailed descriptions.
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R, R0, R1, R2 and R3 represent the registers in which results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
__m128 _mm_cvtpd_ps(__m128d a)
R0 R1 R2 R3
112
Intel(R) C++ Intrinsics Reference
__m128d _mm_cvtps_pd(__m128 a)
R0 R1
(double) a0 (double) a1
__m128d _mm_cvtepi32_pd(__m128i a)
R0 R1
(double) a0 (double) a1
__m128i _mm_cvtpd_epi32(__m128d a)
R0 R1 R2 R3
(int) a0 (int) a1 0x0 0x0
int _mm_cvtsd_si32(__m128d a)
R
(int) a0
113
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1 R2 R3
(float) b0 a1 a2 a3
R0 R1
(double) b a1
R0 R1
(double) b0 a1
__m128i _mm_cvttpd_epi32(__m128d a)
R0 R1 R2 R3
(int) a0 (int) a1 0x0 0x0
int _mm_cvttsd_si32(__m128d a)
R
(int) a0
__m64 _mm_cvtpd_pi32(__m128d a)
114
Intel(R) C++ Intrinsics Reference
R0 R1
(int)a0 (int) a1
__m64 _mm_cvttpd_pi32(__m128d a)
Converts the two DP FP values of a to 32-bit signed integer values using truncate.
R0 R1
(int)a0 (int) a1
__m128d _mm_cvtpi32_pd(__m64 a)
R0 R1
(double)a0 (double)a1
_mm_cvtsd_f64(__m128d a)
This intrinsic extracts a double precision floating point value from the first vector element
of an __m128d. It does so in the most efficient manner possible in the context used. This
intrinsic does not map to any specific SSE2 instruction.
115
Intel® C++ Compiler for Linux* Intrinsics Reference
The load and set operations are similar in that both initialize __m128d data. However,
the set operations take a double argument and are intended for initialization with
constants, while the load operations take a double pointer argument and are intended to
mimic the instructions for loading data from memory.
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R0 and R1 represent the registers in which results are placed.
116
Intel(R) C++ Intrinsics Reference
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
The Double Complex code sample contains examples of how to use several of these
intrinsics.
R0 R1
p[0] p[1]
Loads a single DP FP value, copying to both elements. The address p need not be 16-
byte aligned.
R0 R1
*p *p
Loads two DP FP values in reverse order. The address p must be 16-byte aligned.
117
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1
p[1] p[0]
R0 R1
p[0] p[1]
Loads a DP FP value. The upper DP FP is set to zero. The address p need not be 16-
byte aligned.
R0 R1
*p 0.0
Loads a DP FP value as the upper DP FP value of the result. The lower DP FP value is
passed through from a. The address p need not be 16-byte aligned.
R0 R1
a0 *p
Loads a DP FP value as the lower DP FP value of the result. The upper DP FP value is
passed through from a. The address p need not be 16-byte aligned.
R0 R1
*p a1
118
Intel(R) C++ Intrinsics Reference
The load and set operations are similar in that both initialize __m128d data. However,
the set operations take a double argument and are intended for initialization with
constants, while the load operations take a double pointer argument and are intended to
mimic the instructions for loading data from memory.
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R0 and R1 represent the registers in which results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
119
Intel® C++ Compiler for Linux* Intrinsics Reference
__m128d _mm_set_sd(double w)
Sets the lower DP FP value to w and sets the upper DP FP value to zero.
R0 R1
w 0.0
__m128d _mm_set1_pd(double w)
R0 R1
w w
R0 R1
x w
120
Intel(R) C++ Intrinsics Reference
R0 R1
w x
__m128d _mm_setzero_pd(void)
R0 R1
0.0 0.0
Sets the lower DP FP value to the lower DP FP value of b. The upper DP FP value is
passed through from a.
R0 R1
b0 a1
121
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The detailed description of each intrinsic contains a table detailing the returns. In these
tables, dp[n] is an access to the n element of the result.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
The Double Complex code sample contains example of how to use the _mm_store_pd
intrinsic.
122
Intel(R) C++ Intrinsics Reference
Stores the lower DP FP value of a. The address dp need not be 16-byte aligned.
*dp
a0
Stores the lower DP FP value of a twice. The address dp must be 16-byte aligned.
dp[0] dp[1]
a0 a0
dp[0] dp[1]
a0 a1
dp[0] dp[1]
a0 a1
Stores two DP FP values in reverse order. The address dp must be 16-byte aligned.
dp[0] dp[1]
123
Intel® C++ Compiler for Linux* Intrinsics Reference
a1 a0
*dp
a1
*dp
a0
124
Intel(R) C++ Intrinsics Reference
Integer Intrinsics
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R, R0, R1...R15 represent the registers in which results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
125
Intel® C++ Compiler for Linux* Intrinsics Reference
Adds the 16 signed or unsigned 8-bit integers in a to the 16 signed or unsigned 8-bit
integers in b.
R0 R1 ... R15
a0 + b0 a1 + b1; ... a15 + b15
Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit
integers in b.
R0 R1 ... R7
a0 + b0 a1 + b1 ... a7 + b7
126
Intel(R) C++ Intrinsics Reference
Adds the 4 signed or unsigned 32-bit integers in a to the 4 signed or unsigned 32-bit
integers in b.
R0 R1 R2 R3
a0 + b0 a1 + b1 a2 + b2 a3 + b3
Adds the signed or unsigned 64-bit integer a to the signed or unsigned 64-bit integer b.
R0
a + b
Adds the 2 signed or unsigned 64-bit integers in a to the 2 signed or unsigned 64-bit
integers in b.
R0 R1
a0 + b0 a1 + b1
Adds the 16 signed 8-bit integers in a to the 16 signed 8-bit integers in b using saturating
arithmetic.
R0 R1 ... R15
SignedSaturate (a0 + SignedSaturate (a1 + ... SignedSaturate (a15 +
b0) b1) b15)
Adds the 8 signed 16-bit integers in a to the 8 signed 16-bit integers in b using saturating
arithmetic.
127
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1 ... R7
SignedSaturate (a0 + SignedSaturate (a1 + ... SignedSaturate (a7 +
b0) b1) b7)
Adds the 16 unsigned 8-bit integers in a to the 16 unsigned 8-bit integers in b using
saturating arithmetic.
R0 R1 ... R15
UnsignedSaturate (a0 + UnsignedSaturate (a1 + ... UnsignedSaturate (a15 +
b0) b1) b15)
Adds the 8 unsigned 16-bit integers in a to the 8 unsigned 16-bit integers in b using
saturating arithmetic.
R0 R1 ... R7
UnsignedSaturate (a0 + UnsignedSaturate (a1 + ... UnsignedSaturate (a7 +
b0) b1) b7)
Computes the average of the 16 unsigned 8-bit integers in a and the 16 unsigned 8-bit
integers in b and rounds.
R0 R1 ... R15
(a0 + b0) / 2 (a1 + b1) / 2 ... (a15 + b15) / 2
Computes the average of the 8 unsigned 16-bit integers in a and the 8 unsigned 16-bit
integers in b and rounds.
R0 R1 ... R7
(a0 + b0) / 2 (a1 + b1) / 2 ... (a7 + b7) / 2
128
Intel(R) C++ Intrinsics Reference
Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Adds
the signed 32-bit integer results pairwise and packs the 4 signed 32-bit integer results.
R0 R1 R2 R3
(a0 * b0) + (a1 * (a2 * b2) + (a3 * (a4 * b4) + (a5 * (a6 * b6) + (a7 *
b1) b3) b5) b7)
Computes the pairwise maxima of the 8 signed 16-bit integers from a and the 8 signed
16-bit integers from b.
R0 R1 ... R7
max(a0, b0) max(a1, b1) ... max(a7, b7)
Computes the pairwise maxima of the 16 unsigned 8-bit integers from a and the 16
unsigned 8-bit integers from b.
R0 R1 ... R15
max(a0, b0) max(a1, b1) ... max(a15, b15)
Computes the pairwise minima of the 8 signed 16-bit integers from a and the 8 signed
16-bit integers from b.
R0 R1 ... R7
min(a0, b0) min(a1, b1) ... min(a7, b7)
129
Intel® C++ Compiler for Linux* Intrinsics Reference
Computes the pairwise minima of the 16 unsigned 8-bit integers from a and the 16
unsigned 8-bit integers from b.
R0 R1 ... R15
min(a0, b0) min(a1, b1) ... min(a15, b15)
Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b.
Packs the upper 16-bits of the 8 signed 32-bit results.
R0 R1 ... R7
(a0 * b0)[31:16] (a1 * b1)[31:16] ... (a7 * b7)[31:16]
Multiplies the 8 unsigned 16-bit integers from a by the 8 unsigned 16-bit integers from b.
Packs the upper 16-bits of the 8 unsigned 32-bit results.
R0 R1 ... R7
(a0 * b0)[31:16] (a1 * b1)[31:16] ... (a7 * b7)[31:16]
__m128i_mm_mullo_epi16(__m128i a, __m128i b)
Multiplies the 8 signed or unsigned 16-bit integers from a by the 8 signed or unsigned
16-bit integers from b. Packs the lower 16-bits of the 8 signed or unsigned 32-bit results.
R0 R1 ... R7
(a0 * b0)[15:0] (a1 * b1)[15:0] ... (a7 * b7)[15:0]
Multiplies the lower 32-bit integer from a by the lower 32-bit integer from b, and returns
the 64-bit integer result.
R0
130
Intel(R) C++ Intrinsics Reference
a0 * b0
Multiplies 2 unsigned 32-bit integers from a by 2 unsigned 32-bit integers from b. Packs
the 2 unsigned 64-bit integer results.
R0 R1
a0 * b0 a2 * b2
Computes the absolute difference of the 16 unsigned 8-bit integers from a and the 16
unsigned 8-bit integers from b. Sums the upper 8 differences and lower 8 differences,
and packs the resulting 2 unsigned 16-bit integers into the upper and lower 64-bit
elements.
R0 R1 R2 R3 R4 R5 R6 R7
abs(a0 - b0) + abs(a1 - 0x0 0x0 0x0 abs(a8 - b8) + abs(a9 - 0x0 0x0 0x0
b1) +...+ abs(a7 - b7) b9) +...+ abs(a15 - b15)
Subtracts the 16 signed or unsigned 8-bit integers of b from the 16 signed or unsigned 8-
bit integers of a.
R0 R1 ... R15
a0 - b0 a1 - b1 ... a15 - b15
__m128i_mm_sub_epi16(__m128i a, __m128i b)
Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-
bit integers of a.
R0 R1 ... R7
a0 - b0 a1 - b1 ... a7 - b7
131
Intel® C++ Compiler for Linux* Intrinsics Reference
Subtracts the 4 signed or unsigned 32-bit integers of b from the 4 signed or unsigned 32-
bit integers of a.
R0 R1 R2 R3
a0 - b0 a1 - b1 a2 - b2 a3 - b3
Subtracts the signed or unsigned 64-bit integer b from the signed or unsigned 64-bit
integer a.
R
a - b
Subtracts the 2 signed or unsigned 64-bit integers in b from the 2 signed or unsigned 64-
bit integers in a.
R0 R1
a0 - b0 a1 - b1
Subtracts the 16 signed 8-bit integers of b from the 16 signed 8-bit integers of a using
saturating arithmetic.
R0 R1 ... R15
SignedSaturate (a0 - SignedSaturate (a1 - ... SignedSaturate (a15 -
b0) b1) b15)
132
Intel(R) C++ Intrinsics Reference
Subtracts the 8 signed 16-bit integers of b from the 8 signed 16-bit integers of a using
saturating arithmetic.
R0 R1 ... R15
SignedSaturate (a0 - SignedSaturate (a1 - ... SignedSaturate (a7 -
b0) b1) b7)
Subtracts the 16 unsigned 8-bit integers of b from the 16 unsigned 8-bit integers of a
using saturating arithmetic.
R0 R1 ... R15
UnsignedSaturate (a0 - UnsignedSaturate (a1 - ... UnsignedSaturate (a15 -
b0) b1) b15)
Subtracts the 8 unsigned 16-bit integers of b from the 8 unsigned 16-bit integers of a
using saturating arithmetic.
R0 R1 ... R7
UnsignedSaturate (a0 - UnsignedSaturate (a1 - ... UnsignedSaturate (a7 -
b0) b1) b7)
133
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
134
Intel(R) C++ Intrinsics Reference
The results of each intrinsic operation are placed in register R. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.
R0
a & b
Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the 128-bit
value in a.
R0
(~a) & b
Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b.
R0
a | b
135
Intel® C++ Compiler for Linux* Intrinsics Reference
Computes the bitwise XOR of the 128-bit value in a and the 128-bit value in b.
R0
a ^ b
136
Intel(R) C++ Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in a register. This register is illustrated
for each intrinsic with R and R0-R7. R and R0 R7 each represent one of the pieces of
the result register.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
Note
The count argument is one shift count that applies to all elements of the operand being
shifted. It is not a vector shift count that shifts each element by a different amount.
137
Intel® C++ Compiler for Linux* Intrinsics Reference
Shifts the 128-bit value in a left by imm bytes while shifting in zeros. imm must be an
immediate.
R
a << (imm * 8)
R0 R1 ... R7
a0 << count a1 << count ... a7 << count
R0 R1 ... R7
a0 << count a1 << count ... a7 << count
R0 R1 R2 R3
a0 << count a1 << count a2 << count a3 << count
R0 R1 R2 R3
a0 << count a1 << count a2 << count a3 << count
138
Intel(R) C++ Intrinsics Reference
R0 R1
a0 << count a1 << count
R0 R1
a0 << count a1 << count
R0 R1 ... R7
a0 >> count a1 >> count ... a7 >> count
R0 R1 ... R7
a0 >> count a1 >> count ... a7 >> count
139
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1 R2 R3
a0 >> count a1 >> count a2 >> count a3 >> count
R0 R1 R2 R3
a0 >> count a1 >> count a2 >> count a3 >> count
R
srl(a, imm*8)
R0 R1 ... R7
srl(a0, count) srl(a1, count) ... srl(a7, count)
R0 R1 ... R7
srl(a0, count) srl(a1, count) ... srl(a7, count)
140
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
srl(a0, count) srl(a1, count) srl(a2, count) srl(a3, count)
R0 R1 R2 R3
srl(a0, count) srl(a1, count) srl(a2, count) srl(a3, count)
R0 R1
srl(a0, count) srl(a1, count)
R0 R1
srl(a0, count) srl(a1, count)
141
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R, R0, R1...R15 represent the registers in which results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
142
Intel(R) C++ Intrinsics Reference
Compares the 16 signed or unsigned 8-bit integers in a and the 16 signed or unsigned 8-
bit integers in b for equality.
R0 R1 ... R15
(a0 == b0) ? 0xff : (a1 == b1) ? 0xff : ... (a15 == b15) ? 0xff :
0x0 0x0 0x0
Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-
bit integers in b for equality.
R0 R1 ... R7
(a0 == b0) ? 0xffff : (a1 == b1) ? 0xffff : ... (a7 == b7) ? 0xffff :
0x0 0x0 0x0
Compares the 4 signed or unsigned 32-bit integers in a and the 4 signed or unsigned 32-
bit integers in b for equality.
R0 R1 R2 R3
(a0 == b0) ? (a1 == b1) ? (a2 == b2) ? (a3 == b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
143
Intel® C++ Compiler for Linux* Intrinsics Reference
Compares the 16 signed 8-bit integers in a and the 16 signed 8-bit integers in b for
greater than.
R0 R1 ... R15
(a0 > b0) ? 0xff : 0x0 (a1 > b1) ? 0xff : 0x0 ... (a15 > b15) ? 0xff : 0x0
Compares the 8 signed 16-bit integers in a and the 8 signed 16-bit integers in b for
greater than.
R0 R1 ... R7
(a0 > b0) ? 0xffff : (a1 > b1) ? 0xffff : ... (a7 > b7) ? 0xffff :
0x0 0x0 0x0
Compares the 4 signed 32-bit integers in a and the 4 signed 32-bit integers in b for
greater than.
R0 R1 R2 R3
(a0 > b0) ? (a1 > b1) ? (a2 > b2) ? (a3 > b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
Compares the 16 signed 8-bit integers in a and the 16 signed 8-bit integers in b for less
than.
R0 R1 ... R15
(a0 < b0) ? 0xff : 0x0 (a1 < b1) ? 0xff : 0x0 ... (a15 < b15) ? 0xff : 0x0
Compares the 8 signed 16-bit integers in a and the 8 signed 16-bit integers in b for less
than.
144
Intel(R) C++ Intrinsics Reference
R0 R1 ... R7
(a0 < b0) ? 0xffff : (a1 < b1) ? 0xffff : ... (a7 < b7) ? 0xffff :
0x0 0x0 0x0
Compares the 4 signed 32-bit integers in a and the 4 signed 32-bit integers in b for less
than.
R0 R1 R2 R3
(a0 < b0) ? (a1 < b1) ? (a2 < b2) ? (a3 < b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
145
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R, R0, R1, R2 and R3 represent the registers in which results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
Converts the signed 64-bit integer value in b to a DP FP value. The upper DP FP value
in a is passed through.
R0 R1
(double)b a1
__int64 _mm_cvtsd_si64(__m128d a)
Converts the lower DP FP value of a to a 64-bit signed integer value according to the
current rounding mode.
146
Intel(R) C++ Intrinsics Reference
(__int64) a0
__int64 _mm_cvttsd_si64(__m128d a)
Converts the lower DP FP value of a to a 64-bit signed integer value using truncation.
R
(__int64) a0
__m128 _mm_cvtepi32_ps(__m128i a)
R0 R1 R2 R3
(float) a0 (float) a1 (float) a2 (float) a3
__m128i _mm_cvtps_epi32(__m128 a)
R0 R1 R2 R3
(int) a0 (int) a1 (int) a2 (int) a3
__m128i _mm_cvttps_epi32(__m128 a)
R0 R1 R2 R3
(int) a0 (int) a1 (int) a2 (int) a3
147
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R, R0, R1, R2 and R3 represent the registers in which results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
148
Intel(R) C++ Intrinsics Reference
__m128i _mm_cvtsi32_si128(int a)
Moves 32-bit integer a to the least significant 32 bits of an __m128i object. Zeroes the
upper 96 bits of the __m128i object.
R0 R1 R2 R3
a 0x0 0x0 0x0
__m128i _mm_cvtsi64_si128(__int64 a)
Moves 64-bit integer a to the lower 64 bits of an __m128i object, zeroing the upper bits.
R0 R1
a 0x0
int _mm_cvtsi128_si32(__m128i a)
R
a0
__int64 _mm_cvtsi128_si64(__m128i a)
R
a0
149
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R, R0 and R1 represent the registers in which results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
150
Intel(R) C++ Intrinsics Reference
R
*p
R
*p
Load the lower 64 bits of the value pointed to by p into the lower 64 bits of the result,
zeroing the upper 64 bits of the result.
R0 R1
*p[63:0] 0x0
151
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed explanation of
each intrinsic. R, R0, R1...R15 represent the registers in which results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
152
Intel(R) C++ Intrinsics Reference
R0 R1
q0 q1
R0 R1 R2 R3
i0 i1 i2 i3
__m128i _mm_set_epi16(short w7, short w6, short w5, short w4, short w3,
short w2, short w1, short w0)
R0 R1 ... R7
w0 w1 ... w7
153
Intel® C++ Compiler for Linux* Intrinsics Reference
__m128i _mm_set_epi8(char b15, char b14, char b13, char b12, char b11,
char b10, char b9, char b8, char b7, char b6, char b5, char b4, char b3,
char b2, char b1, char b0)
R0 R1 ... R15
b0 b1 ... b15
__m128i _mm_set1_epi64(__m64 q)
R0 R1
q q
__m128i _mm_set1_epi32(int i)
R0 R1 R2 R3
i i i i
__m128i _mm_set1_epi16(short w)
R0 R1 ... R7
w w w w
__m128i _mm_set1_epi8(char b)
R0 R1 ... R15
b b b b
154
Intel(R) C++ Intrinsics Reference
R0 R1
q0 q1
R0 R1 R2 R3
i0 i1 i2 i3
__m128i _mm_setr_epi16(short w0, short w1, short w2, short w3, short w4,
short w5, short w6, short w7)
R0 R1 ... R7
w0 w1 ... w7
__m128i _mm_setr_epi8(char b15, char b14, char b13, char b12, char b11,
char b10, char b9, char b8, char b7, char b6, char b5, char b4, char b3,
char b2, char b1, char b0)
R0 R1 ... R15
b0 b1 ... b15
__m128i _mm_setzero_si128()
155
Intel® C++ Compiler for Linux* Intrinsics Reference
0x0
The detailed description of each intrinsic contains a table detailing the returns. In these
tables, p is an access to the result.
156
Intel(R) C++ Intrinsics Reference
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
*p
a
*p
a
Conditionally store byte elements of d to address p. The high bit of each byte in the
selector n determines whether the corresponding byte in d will be stored. Address p
need not be 16-byte aligned.
157
Intel® C++ Compiler for Linux* Intrinsics Reference
*p[63:0]
a0
The prototypes for Streaming SIMD Extensions 2 (SSE2) intrinsics are in the
emmintrin.h header file.
Stores the data in a to the address p without polluting caches. The address p must be
16-byte aligned. If the cache line containing address p is already in the cache, the cache
will be updated.
p[0] := a0
p[1] := a1
p[0] p[1]
a0 a1
Stores the data in a to the address p without polluting the caches. If the cache line
containing address p is already in the cache, the cache will be updated. Address p must
be 16-byte aligned.
*p
a
Stores the data in a to the address p without polluting the caches. If the cache line
containing address p is already in the cache, the cache will be updated.
*p
a
159
Intel® C++ Compiler for Linux* Intrinsics Reference
Cache line containing p is flushed and invalidated from all caches in the coherency
domain.
void _mm_lfence(void)
Guarantees that every load instruction that precedes, in program order, the load fence
instruction is globally visible before any load instruction which follows the fence in
program order.
void _mm_mfence(void)
Guarantees that every memory access that precedes, in program order, the memory
fence instruction is globally visible before any memory instruction which follows the
fence in program order.
160
Intel(R) C++ Intrinsics Reference
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
161
Intel® C++ Compiler for Linux* Intrinsics Reference
Packs the 16 signed 16-bit integers from a and b into 8-bit integers and saturates.
Packs the 8 signed 32-bit integers from a and b into signed 16-bit integers and
saturates.
R0 ... R3 R4 ... R7
Signed ... Signed Signed ... Signed
Saturate(a0) Saturate(a3) Saturate(b0) Saturate(b3)
Packs the 16 signed 16-bit integers from a and b into 8-bit unsigned integers and
saturates.
Extracts the selected signed or unsigned 16-bit integer from a and zero extends. The
selector imm must be an immediate.
162
Intel(R) C++ Intrinsics Reference
R0
(imm == 0) ? a0: ( (imm == 1) ? a1: ... (imm==7) ? a7)
Inserts the least significant 16 bits of b into the selected 16-bit integer of a. The selector
imm must be an immediate.
R0 R1 ... R7
(imm == 0) ? b : a0; (imm == 1) ? b : a1; ... (imm == 7) ? b : a7;
int _mm_movemask_epi8(__m128i a)
Creates a 16-bit mask from the most significant bits of the 16 signed or unsigned 8-bit
integers in a and zero extends the upper bits.
R0
a15[7] << 15 | a14[7] << 14 | ... a1[7] << 1 | a0[7]
Shuffles the 4 signed or unsigned 32-bit integers in a as specified by imm. The shuffle
value, imm, must be an immediate. See Macro Function for Shuffle for a description of
shuffle semantics.
Shuffles the upper 4 signed or unsigned 16-bit integers in a as specified by imm. The
shuffle value, imm, must be an immediate. See Macro Function for Shuffle for a
description of shuffle semantics.
Shuffles the lower 4 signed or unsigned 16-bit integers in a as specified by imm. The
shuffle value, imm, must be an immediate. See Macro Function for Shuffle for a
description of shuffle semantics.
163
Intel® C++ Compiler for Linux* Intrinsics Reference
Interleaves the upper 8 signed or unsigned 8-bit integers in a with the upper 8 signed or
unsigned 8-bit integers in b.
Interleaves the upper 4 signed or unsigned 16-bit integers in a with the upper 4 signed or
unsigned 16-bit integers in b.
R0 R1 R2 R3 R4 R5 R6 R7
a4 b4 a5 b5 a6 b6 a7 b7
Interleaves the upper 2 signed or unsigned 32-bit integers in a with the upper 2 signed or
unsigned 32-bit integers in b.
R0 R1 R2 R3
a2 b2 a3 b3
Interleaves the upper signed or unsigned 64-bit integer in a with the upper signed or
unsigned 64-bit integer in b.
R0 R1
a1 b1
Interleaves the lower 8 signed or unsigned 8-bit integers in a with the lower 8 signed or
unsigned 8-bit integers in b.
164
Intel(R) C++ Intrinsics Reference
Interleaves the lower 4 signed or unsigned 16-bit integers in a with the lower 4 signed or
unsigned 16-bit integers in b.
R0 R1 R2 R3 R4 R5 R6 R7
a0 b0 a1 b1 a2 b2 a3 b3
Interleaves the lower 2 signed or unsigned 32-bit integers in a with the lower 2 signed or
unsigned 32-bit integers in b.
R0 R1 R2 R3
a0 b0 a1 b1
Interleaves the lower signed or unsigned 64-bit integer in a with the lower signed or
unsigned 64-bit integer in b.
R0 R1
a0 b0
__m64 _mm_movepi64_pi64(__m64 a)
R0
a0
__128i _mm_movpi64_pi64(__m128i a)
165
Intel® C++ Compiler for Linux* Intrinsics Reference
Moves the 64 bits of a to the lower 64 bits of the result, zeroing the upper bits.
R0 R1
a0 0X0
__128i _mm_move_epi64(__128i a)
Moves the lower 64 bits of a to the lower 64 bits of the result, zeroing the upper bits.
R0 R1
a0 0X0
R0 R1
a1 b1
R0 R1
a0 b0
int _mm_movemask_pd(__m128d a)
Creates a two-bit mask from the sign bits of the two DP FP values of a.
R
sign(a1) << 1 | sign(a0)
166
Intel(R) C++ Intrinsics Reference
Selects two specific DP FP values from a and b, based on the mask i. The mask must
be an immediate. See Macro Function for Shuffle for a description of the shuffle
semantics.
167
Intel® C++ Compiler for Linux* Intrinsics Reference
This version of the Intel® C++ Compiler supports casting between various SP, DP, and
INT vector types. These intrinsics do not convert values; they change one data type to
another without changing the value.
The intrinsics for casting support do not correspond to any Streaming SIMD Extensions
2 (SSE2) instructions.
void _mm_pause(void)
PAUSE Intrinsic
The PAUSE intrinsic is used in spin-wait loops with the processors implementing dynamic
execution (especially out-of-order execution). In the spin-wait loop, PAUSE improves the
speed at which the code detects the release of the lock. For dynamic scheduling, the
PAUSE instruction reduces the penalty of exiting from the spin-loop.
spin loop:pause
cmp eax, A
jne spin_loop
In this example, the program spins until memory location A matches the value in register
eax. The code sequence that follows shows a test-and-test-and-set. In this example, the
spin occurs only after the attempt to get a lock has failed.
168
Intel(R) C++ Intrinsics Reference
Critical Section
// critical_section code
mov A, 0 ; Release lock
jmp continue
spin_loop: pause;
// spin-loop hint
cmp 0, A ;
// check lock availability
jne spin_loop
jmp get_lock
// continue: other code
Note that the first branch is predicted to fall-through to the critical section in anticipation
of successfully gaining access to the lock. It is highly recommended that all spin-wait
loops include the PAUSE instruction. Since PAUSE is backwards compatible to all existing
IA-32 processor generations, a test for processor type (a CPUID test) is not needed. All
legacy processors will execute PAUSE as a NOP, but in processors which use the PAUSE
as a hint there can be significant performance benefit.
You can view the two integers as selectors for choosing which two words from the first
input operand and which two words from the second are to be put into the result word.
169
Intel® C++ Compiler for Linux* Intrinsics Reference
The prototypes for these intrinsics are in the pmmintrin.h header file.
Note
You can also use the single ia32intrin.h header file for any IA-32 intrinsics.
Loads an unaligned 128-bit value. This differs from movdqu in that it can provide higher
performance in some cases. However, it also may provide lower performance than
movdqu if the memory value being read was just previously written.
170
Intel(R) C++ Intrinsics Reference
*p;
The results of each intrinsic operation are placed in the registers R0, R1, R2, and R3.
To see detailed information about an intrinsic, click on that intrinsic name in the following
table.
The prototypes for these intrinsics are in the pmmintrin.h header file.
R0 R1 R2 R3
a0 - b0; a1 + b1; a2 - b2; a3 + b3;
R0 R1 R2 R3
a0 + a1; a2 + a3; b0 + b1; b2 + b3;
171
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1 R2 R3
a0 - a1; a2 - a3; b0 - b1; b2 - b3;
R0 R1 R2 R3
a1; a1; a3; a3;
a0;
172
Intel(R) C++ Intrinsics Reference
The floating-point intrinsics listed here are designed for the Intel® Pentium® 4
processor with Streaming SIMD Extensions 3 (SSE3).
The results of each intrinsic operation are placed in the registers R0 and R1.
To see detailed information about an intrinsic, click on that intrinsic name in the following table.
The prototypes for these intrinsics are in the pmmintrin.h header file.
R0 R1
a0 - b0; a1 + b1;
173
Intel® C++ Compiler for Linux* Intrinsics Reference
R0 R1
a0 + a1; b0 + b1;
R0 R1
a0 - a1; b0 - b1;
R0 R1
*dp; *dp;
R0 R1
a0; a0;
174
Intel(R) C++ Intrinsics Reference
The prototypes for these intrinsics are in the pmmintrin.h header file.
_MM_SET_DENORMALS_ZERO_MODE(x)
_MM_GET_DENORMALS_ZERO_MODE()
No arguments. This returns the current value of the denormals are zero mode bit of the
control register.
175
Intel® C++ Compiler for Linux* Intrinsics Reference
The prototypes for these intrinsics are in the pmmintrin.h header file.
Generates the MONITOR instruction. This sets up an address range for the monitor
hardware using p to provide the logical address, and will be passed to the monitor
instruction in register eax. The extensions parameter contains optional extensions to the
monitor hardware which will be passed in ecx. The hints parameter will contain hints to
the monitor hardware, which will be passed in edx. A non-zero value for extensions will
cause a general protection fault.
Generates the MWAIT instruction. This instruction is a hint that allows the processor to
stop execution and enter an implementation-dependent optimized state until occurrence
of a class of events. In future processor designs extensions and hints parameters may
be used to convey additional information to the processor. All non-zero values of
extensions and hints are reserved. A non-zero value for extensions will cause a general
protection fault.
The prototypes for these intrinsics are in the ia64intrin.h header file.
Itanium processor do not support SSE2 intrinsics. However, you can use the sse2mmx.h
emulation pack to enable support for SSE2 instructions on Itanium architecture.
For information on how to use SSE intrinsics on Itanium architecture, see Using
Streaming SIMD Extensions on Itanium(R) Architecture.
For information on how to use MMX (TM) technology intrinsics on Itanium architecture,
see MMX(TM) Technology Intrinsics on Itanium Architecture
Integer Operations
176
Intel(R) C++ Intrinsics Reference
FSR Operations
Intrinsic Description
void _fsetc(int Sets the control bits of FPSR.sf0. Maps to the fsetc.sf0 r, r
amask, int omask) instruction. There is no corresponding instruction to read the
control bits. Use _mm_getfpsr().
void _fclrf(void) Clears the floating point status flags (the 6-bit flags of
FPSR.sf0). Maps to the fclrf.sf0 instruction.
The right-justified 64-bit value r is deposited into the value in s at an arbitrary bit position
and the result is returned. The deposited bit field begins at bit position pos and extends
to the left (toward the most significant bit) the number of bits specified by len.
The sign-extended value v (either all 1s or all 0s) is deposited into the value in s at an
arbitrary bit position and the result is returned. The deposited bit field begins at bit
position p and extends to the left (toward the most significant bit) the number of bits
specified by len.
177
Intel® C++ Compiler for Linux* Intrinsics Reference
The right-justified 64-bit value s is deposited into a 64-bit field of all zeros at an arbitrary
bit position and the result is returned. The deposited bit field begins at bit position pos
and extends to the left (toward the most significant bit) the number of bits specified by
len.
The sign-extended value v (either all 1s or all 0s) is deposited into a 64-bit field of all
zeros at an arbitrary bit position and the result is returned. The deposited bit field begins
at bit position pos and extends to the left (toward the most significant bit) the number of
bits specified by len.
A field is extracted from the 64-bit value r and is returned right-justified and sign
extended. The extracted field begins at position pos and extends len bits to the left. The
sign is taken from the most significant bit of the extracted field.
A field is extracted from the 64-bit value r and is returned right-justified and zero
extended. The extracted field begins at position pos and extends len bits to the left.
The 64-bit values a and b are treated as signed integers and multiplied to produce a full
128-bit signed result. The 64-bit value c is zero-extended and added to the product. The
least significant 64 bits of the sum are then returned.
The 64-bit values a and b are treated as signed integers and multiplied to produce a full
128-bit unsigned result. The 64-bit value c is zero-extended and added to the product.
The least significant 64 bits of the sum are then returned.
178
Intel(R) C++ Intrinsics Reference
The 64-bit values a and b are treated as signed integers and multiplied to produce a full
128-bit signed result. The 64-bit value c is zero-extended and added to the product. The
most significant 64 bits of the sum are then returned.
The 64-bit values a and b are treated as unsigned integers and multiplied to produce a
full 128-bit unsigned result. The 64-bit value c is zero-extended and added to the
product. The most significant 64 bits of the sum are then returned.
__int64 _m64_popcnt(__int64 a)
The number of bits in the 64-bit integer a that have the value 1 are counted, and the
resulting sum is returned.
a is shifted to the left by count bits and then added to b. The result is returned.
a and b are concatenated to form a 128-bit value and shifted to the right count bits. The
least significant 64 bits of the result are returned.
179
Intel® C++ Compiler for Linux* Intrinsics Reference
Intrinsic Description
unsigned __int64 Map to the xchg1 instruction.
_InterlockedExchange8(volatile unsigned char Atomically write the least
*Target, unsigned __int64 value)
significant byte of its 2nd
argument to address specified
by its 1st argument.
unsigned __int64 Compare and exchange
_InterlockedCompareExchange8_rel(volatile atomically the least significant
unsigned char *Destination, unsigned __int64
Exchange, unsigned __int64 Comparand) byte at the address specified
by its 1st argument. Maps to
the cmpxchg1.rel instruction
with appropriate setup.
unsigned __int64 Same as the previous intrinsic,
_InterlockedCompareExchange8_acq(volatile but using acquire semantic.
unsigned char *Destination, unsigned __int64
Exchange, unsigned __int64 Comparand)
unsigned __int64 Map to the xchg2 instruction.
InterlockedExchange16(volatile unsigned short Atomically write the least
*Target, unsigned __int64 value)
significant word of its 2nd
180
Intel(R) C++ Intrinsics Reference
181
Intel® C++ Compiler for Linux* Intrinsics Reference
182
Intel(R) C++ Intrinsics Reference
Note
Uses cmpxchg to do an atomic sub of the incr value to the target. Maps to a loop with
the cmpxchg instruction to guarantee atomicity.
183
Intel® C++ Compiler for Linux* Intrinsics Reference
The prototypes for these intrinsics are in the ia64intrin.h header file.
Intrinsic Description
unsigned __int64 Gets the value from a hardware register based on
__getReg(const int whichReg) the index passed in. Produces a corresponding mov
= r instruction. Provides access to the following
registers:
See Register Names for getReg() and setReg().
void __setReg(const int Sets the value for a hardware register based on the
whichReg, unsigned __int64 index passed in. Produces a corresponding mov =
value)
r instruction.
See Register Names for getReg() and setReg().
unsigned __int64 Return the value of an indexed register. The index
__getIndReg(const int is the 2nd argument; the register file is the first
whichIndReg, __int64 index)
argument.
void __setIndReg(const int Copy a value in an indexed register. The index is
whichIndReg, __int64 index, the 2nd argument; the register file is the first
unsigned __int64 value)
argument.
void *__ptr64 _rdteb(void) Gets TEB address. The TEB address is kept in r13
and maps to the move r=tp instruction
void __isrlz(void) Executes the serialize instruction. Maps to the
srlz.i instruction.
void __dsrlz(void) Serializes the data. Maps to the srlz.d instruction.
unsigned __int64 Map the fetchadd4.acq instruction.
__fetchadd4_acq(unsigned int
*addend, const int increment)
unsigned __int64 Map the fetchadd4.rel instruction.
__fetchadd4_rel(unsigned int
*addend, const int increment)
unsigned __int64 Map the fetchadd8.acq instruction.
__fetchadd8_acq(unsigned
__int64 *addend, const int
increment)
unsigned __int64 Map the fetchadd8.rel instruction.
__fetchadd8_rel(unsigned
__int64 *addend, const int
increment)
void __fwb(void) Flushes the write buffers. Maps to the fwb
instruction.
void __ldfs(const int Map the ldfs instruction. Load a single precision
whichFloatReg, void *src) value to the specified register.
void __ldfd(const int Map the ldfd instruction. Load a double precision
whichFloatReg, void *src) value to the specified register.
void __ldfe(const int Map the ldfe instruction. Load an extended
whichFloatReg, void *src)
184
Intel(R) C++ Intrinsics Reference
185
Intel® C++ Compiler for Linux* Intrinsics Reference
void __ptrd(__int64 va, Purges the translation register. Maps to the ptr.d
__int64 pagesz) r, r instruction.
__int64 __tpa(__int64 va) Map the tpa instruction.
void __invalat(void) Invalidates ALAT. Maps to the invala instruction.
void __invala (void) Same as void __invalat(void)
void __invala_gr(const int whichGeneralReg = 0-127
whichGeneralReg)
void __invala_fr(const int whichFloatReg = 0-127
whichFloatReg)
void __break(const int) Generates a break instruction with an immediate.
void __nop(const int) Generate a nop instruction.
void __debugbreak(void) Generates a Debug Break Instruction fault.
void __fc(__int64) Flushes a cache line associated with the address
given by the argument. Maps to the fc instruction.
void __sum(int mask) Sets the user mask bits of PSR. Maps to the sum
imm24 instruction.
void __rum(int mask) Resets the user mask.
__int64 _ReturnAddress(void) Get the caller's address.
void __lfetch(int lfhint, Generate the lfetch.lfhint instruction. The value
void *y) of the first argument specifies the hint type.
void __lfetch_fault(int Generate the lfetch.fault.lfhint instruction.
lfhint, void *y) The value of the first argument specifies the hint
type.
void __lfetch_excl(int Generate the lfetch.excl.lfhint instruction. The
lfhint, void *y) value {0|1|2|3} of the first argument specifies the
hint type.
void __lfetch_fault_excl(int Generate the lfetch.fault.excl.lfhint
lfhint, void *y) instruction. The value of the first argument specifies
the hint type.
unsigned int __cacheSize(n) returns the size in bytes of the
__cacheSize(unsigned int cache at level n. 1 represents the first-level cache. 0
cacheLevel)
is returned for a non-existent cache level. For
example, an application may query the cache size
and use it to select block sizes in algorithms that
operate on matrices.
void __memory_barrier(void) Creates a barrier across which the compiler will not
schedule any data access instruction. The compiler
may allocate local data in registers across a
memory barrier, but not global data.
void __ssm(int mask) Sets the system mask. Maps to the ssm imm24
186
Intel(R) C++ Intrinsics Reference
instruction.
void __rsm(int mask) Resets the system mask bits of PSR. Maps to the
rsm imm24 instruction.
Conversion Intrinsics
The prototypes for these intrinsics are in the ia64intrin.h header file.
Intrinsic Description
__int64 _m_to_int64(__m64 a) Convert a of type __m64 to type __int64.
Translates to nop since both types reside in
the same register on Itanium-based systems.
__m64 _m_from_int64(__int64 a) Convert a of type __int64 to type __m64.
Translates to nop since both types reside in
the same register on Itanium-based systems.
__int64 Convert its double precision argument to a
__round_double_to_int64(double d) signed integer.
Name whichReg
_IA64_REG_IP 1016
_IA64_REG_PSR 1019
_IA64_REG_PSR_L 1019
Application Registers
187
Intel® C++ Compiler for Linux* Intrinsics Reference
Name whichReg
_IA64_REG_AR_KR0 3072
_IA64_REG_AR_KR1 3073
_IA64_REG_AR_KR2 3074
_IA64_REG_AR_KR3 3075
_IA64_REG_AR_KR4 3076
_IA64_REG_AR_KR5 3077
_IA64_REG_AR_KR6 3078
_IA64_REG_AR_KR7 3079
_IA64_REG_AR_RSC 3088
_IA64_REG_AR_BSP 3089
_IA64_REG_AR_BSPSTORE 3090
_IA64_REG_AR_RNAT 3091
_IA64_REG_AR_FCR 3093
_IA64_REG_AR_EFLAG 3096
_IA64_REG_AR_CSD 3097
_IA64_REG_AR_SSD 3098
_IA64_REG_AR_CFLAG 3099
_IA64_REG_AR_FSR 3100
_IA64_REG_AR_FIR 3101
_IA64_REG_AR_FDR 3102
_IA64_REG_AR_CCV 3104
_IA64_REG_AR_UNAT 3108
_IA64_REG_AR_FPSR 3112
_IA64_REG_AR_ITC 3116
_IA64_REG_AR_PFS 3136
_IA64_REG_AR_LC 3137
_IA64_REG_AR_EC 3138
Control Registers
Name whichReg
_IA64_REG_CR_DCR 4096
188
Intel(R) C++ Intrinsics Reference
_IA64_REG_CR_ITM 4097
_IA64_REG_CR_IVA 4098
_IA64_REG_CR_PTA 4104
_IA64_REG_CR_IPSR 4112
_IA64_REG_CR_ISR 4113
_IA64_REG_CR_IIP 4115
_IA64_REG_CR_IFA 4116
_IA64_REG_CR_ITIR 4117
_IA64_REG_CR_IIPA 4118
_IA64_REG_CR_IFS 4119
_IA64_REG_CR_IIM 4120
_IA64_REG_CR_IHA 4121
_IA64_REG_CR_LID 4160
_IA64_REG_CR_IVR 4161 ^
_IA64_REG_CR_TPR 4162
_IA64_REG_CR_EOI 4163
_IA64_REG_CR_IRR0 4164 ^
_IA64_REG_CR_IRR1 4165 ^
_IA64_REG_CR_IRR2 4166 ^
_IA64_REG_CR_IRR3 4167 ^
_IA64_REG_CR_ITV 4168
_IA64_REG_CR_PMV 4169
_IA64_REG_CR_CMCV 4170
_IA64_REG_CR_LRR0 4176
_IA64_REG_CR_LRR1 4177
^ getReg only
189
Intel® C++ Compiler for Linux* Intrinsics Reference
_IA64_REG_INDR_PKR 9003
_IA64_REG_INDR_PMC 9004
_IA64_REG_INDR_PMD 9005
_IA64_REG_INDR_RR 9006
_IA64_REG_INDR_RESERVED 9007
^ getIndReg only
Multimedia Additions
The prototypes for these intrinsics are in the ia64intrin.h header file.
For detailed information about an intrinsic, click on the intrinsic name in the following
table.
190
Intel(R) C++ Intrinsics Reference
__int64 _m64_czx1l(__m64 a)
The 64-bit value a is scanned for a zero element from the most significant element to the
least significant element, and the index of the first zero element is returned. The element
width is 8 bits, so the range of the result is from 0 - 7. If no zero element is found, the
default result is 8.
__int64 _m64_czx1r(__m64 a)
The 64-bit value a is scanned for a zero element from the least significant element to the
most significant element, and the index of the first zero element is returned. The element
width is 8 bits, so the range of the result is from 0 - 7. If no zero element is found, the
default result is 8.
__int64 _m64_czx2l(__m64 a)
The 64-bit value a is scanned for a zero element from the most significant element to the
least significant element, and the index of the first zero element is returned. The element
width is 16 bits, so the range of the result is from 0 - 3. If no zero element is found, the
default result is 4.
__int64 _m64_czx2r(__m64 a)
The 64-bit value a is scanned for a zero element from the least significant element to the
most significant element, and the index of the first zero element is returned. The element
width is 16 bits, so the range of the result is from 0 - 3. If no zero element is found, the
default result is 4.
191
Intel® C++ Compiler for Linux* Intrinsics Reference
Interleave 64-bit quantities a and b in 1-byte groups, starting from the left, as shown in
Figure 1, and return the result.
Interleave 64-bit quantities a and b in 1-byte groups, starting from the right, as shown in
Figure 2, and return the result.
Interleave 64-bit quantities a and b in 2-byte groups, starting from the left, as shown in
Figure 3, and return the result.
Interleave 64-bit quantities a and b in 2-byte groups, starting from the right, as shown in
Figure 4, and return the result.
192
Intel(R) C++ Intrinsics Reference
Interleave 64-bit quantities a and b in 4-byte groups, starting from the left, as shown in
Figure 5, and return the result.
Interleave 64-bit quantities a and b in 4-byte groups, starting from the right, as shown in
Figure 6, and return the result.
193
Intel® C++ Compiler for Linux* Intrinsics Reference
The unsigned data elements (bytes) of b are subtracted from the unsigned data
elements (bytes) of a and the results of the subtraction are then each independently
shifted to the right by one position. The high-order bits of each element are filled with the
borrow bits of the subtraction.
The unsigned data elements (double bytes) of b are subtracted from the unsigned data
elements (double bytes) of a and the results of the subtraction are then each
194
Intel(R) C++ Intrinsics Reference
independently shifted to the right by one position. The high-order bits of each element
are filled with the borrow bits of the subtraction.
Two signed 16-bit data elements of a, starting with the most significant data element, are
multiplied by the corresponding two signed 16-bit data elements of b, and the two 32-bit
results are returned as shown in Figure 9.
Two signed 16-bit data elements of a, starting with the least significant data element, are
multiplied by the corresponding two signed 16-bit data elements of b, and the two 32-bit
results are returned as shown in Figure 10.
The four signed 16-bit data elements of a are multiplied by the corresponding signed 16-
bit data elements of b, yielding four 32-bit products. Each product is then shifted to the
right count bits and the least significant 16 bits of each shifted product form 4 16-bit
results, which are returned as one 64-bit word.
195
Intel® C++ Compiler for Linux* Intrinsics Reference
The four unsigned 16-bit data elements of a are multiplied by the corresponding
unsigned 16-bit data elements of b, yielding four 32-bit products. Each product is then
shifted to the right count bits and the least significant 16 bits of each shifted product form
4 16-bit results, which are returned as one 64-bit word.
a is shifted to the left by count bits and then is added to b. The upper 32 bits of the result
are forced to 0, and then bits [31:30] of b are copied to bits [62:61] of the result. The
result is returned.
The four signed 16-bit data elements of a are each independently shifted to the right by
count bits (the high order bits of each element are filled with the initial value of the sign
bits of the data elements in a); they are then added to the four signed 16-bit data
elements of b. The result is returned.
a is added to b as four separate 16-bit wide elements. The elements of a are treated as
unsigned, while the elements of b are treated as signed. The results are treated as
unsigned and are returned as one 64-bit word.
196
Intel(R) C++ Intrinsics Reference
a is subtracted from b as four separate 16-bit wide elements. The elements of a are
treated as unsigned, while the elements of b are treated as signed. The results are
treated as unsigned and are returned as one 64-bit word.
The unsigned byte-wide data elements of a are added to the unsigned byte-wide data
elements of b and the results of each add are then independently shifted to the right by
one position. The high-order bits of each element are filled with the carry bits of the
sums.
The unsigned 16-bit wide data elements of a are added to the unsigned 16-bit wide data
elements of b and the results of each add are then independently shifted to the right by
one position. The high-order bits of each element are filled with the carry bits of the
sums.
197
Intel® C++ Compiler for Linux* Intrinsics Reference
Synchronization Primitives
The synchronization primitive intrinsics provide a variety of operations. Besides
performing these operations, each intrinsic has two key properties:
198
Intel(R) C++ Intrinsics Reference
Miscellaneous Intrinsics
void* __get_return_address(unsigned int level);
This intrinsic yields the return address of the current function. The level argument must
be a constant value. A value of 0 yields the return address of the current function. Any
other value yields a zero return address. On Linux systems, this intrinsic is synonymous
with __builtin_return_address. The name and the argument are provided for
compatibility with gcc*.
This intrinsic overwrites the default return address of the current function with the
address indicated by its argument. On return from the current invocation, program
execution continues at the address provided.
This intrinsic returns the frame address of the current function. The level argument
must be a constant value. A value of 0 yields the frame address of the current function.
Any other value yields a zero return value. On Linux systems, this intrinsic is
synonymous with __builtin_frame_address. The name and the argument are
provided for compatibility with gcc.
199
Intel® C++ Compiler for Linux* Intrinsics Reference
The Dual-Core Intel® Itanium® 2 Processor 9000 Sequence processor supports the
intrinsics listed in the table below.
These intrinsics each generate Itanium instructions. The first alpha-numerical chain in
the intrinsic name represents the return type, and the second alpha-numerical chain in
the intrinsic name represents the instruction the intrinsic generates. For example, the
intrinsic _int64_cmp8xchg generates the _int64 return type and the cmp8xchg Itanium
instruction.
For detailed information about an intrinsic, click on that intrinsic in the following table.
For more information about the instructions these intrinsics generate, please see the
Intel Itanium Architecture Software Developer Manual, Volume 3: Instruction Set
Reference in the documentation area of the Itanium 2 processor website.
Note
Calling these intrinsics on any previous Itanium® processor causes an illegal instruction
fault.
Generates the 16-byte form of the Itanium compare and exchange instruction.
Returns the original 64-bit value read from memory at the specified address.
200
Intel(R) C++ Intrinsics Reference
The following table describes each implicit argument for this intrinsic.
xchg_hi cmpnd
Highest 8 bytes of the exchange value. The 64-bit compare value. Use the __setReg
Use the setReg intrinsic to set the intrinsic to set the <cmpnd> value in the register
<xchg_hi> value in the register AR[CCV]. [__setReg
AR[CSD]. [__setReg (_IA64_REG_AR_CCV,<cmpnd>);]
(_IA64_REG_AR_CSD, <xchg_hi>); ].
Generates the Itanium instruction that loads 16 bytes from the given address.
Returns the lower 8 bytes of the quantity loaded from <addr>. The higher 8 bytes are
loaded in register AR[CSD].
Generates implicit return of the higher 8 bytes to the register AR[CSD]. You can use the
__getReg intrinsic to copy the value into a user variable. [foo =
__getReg(_IA64_REG_AR_CSD);]
Generates the Itanium instruction that flushes the cache line associated with the
specified address and ensures coherency between instruction cache and data cache.
cache_line
An address associated with the cache line you want to flush
201
Intel® C++ Compiler for Linux* Intrinsics Reference
Generates the Itanium instruction that provides performance hints about the program
being executed.
hint_value
A literal value that specifies the hint. Currently, zero is the only legal value. __hint(0)
generates the Itanium hint@pause instruction.
The following table describes the implicit argument for this intrinsic.
src_hi
The highest 8 bytes of the 16-byte value to store. Use the setReg intrinsic to set the
<src_hi> value in the register AR[CSD]. [__setReg(_IA64_REG_AR_CSD, <src_hi>);
]
Examples
The following examples show how to use the intrinsics listed above to generate the
corresponding instructions. In all cases, use the __setReg (resp. __getReg) intrinsic to
set up implicit arguments (resp. retrieve implicit return values).
202
Intel(R) C++ Intrinsics Reference
// file foo.c
//
#include <ia64intrin.h>
/**/
// The following two calls load the 16-byte value at the given
address
// The call to __getReg moves that value into a user variable (hi).
// ld16 Ra,ar.csd=[Rb]
*hi = __getReg(_IA64_REG_AR_CSD);
/**/
/**/
// This is the same as the previous example, except that it uses the
// ld16.acq Ra,ar.csd=[Rb]
//
203
Intel® C++ Compiler for Linux* Intrinsics Reference
*hi = __getReg(_IA64_REG_AR_CSD);
/**/
/**/
// first set the highest 64-bits into CSD register. Then call
//
__setReg(_IA64_REG_AR_CSD, hi);
/**/
__int64 old_value;
/**/
// set the highest bits of the exchange value and the comperand
value
//
__setReg(_IA64_REG_AR_CSD, xchg_hi);
__setReg(_IA64_REG_AR_CCV, cmpnd);
/**/
204
Intel(R) C++ Intrinsics Reference
return old_value;
// end foo.c
205
Intel® C++ Compiler for Linux* Intrinsics Reference
For detailed information about an intrinsic, click on that intrinsic name in the following
table.
206
Intel(R) C++ Intrinsics Reference
207
Intel® C++ Compiler for Linux* Intrinsics Reference
Generates the Itanium instruction that atomically reads 128 bits from the memory
location.
Source DestinationHigh
Pointer to the 128-bit Pointer to the location in memory that stores the highest 64 bits
Source value of the 128-bit loaded value
Generates the Itanium instruction that atomically reads 128 bits from the memory
location. Same as __load128, but the this intrinsic uses acquire semantics.
Source DestinationHigh
Pointer to the 128-bit Pointer to the location in memory that stores the highest 64 bits
Source value of the 128-bit loaded value
Generates the Itanium instruction that atomically stores 128 bits at the destination
memory location.
No returns.
208
Intel(R) C++ Intrinsics Reference
Generates the Itanium instruction that atomically stores 128 bits at the destination
memory location. Same as __store128, but this intrinsic uses release semantics.
No returns.
209
Intel® C++ Compiler for Linux* Intrinsics Reference
• Alignment Support
• Allocating and Freeing Aligned Memory Blocks
• Inline Assembly
Alignment Support
Aligning data improves the performance of intrinsics. When using the Streaming SIMD
Extensions, you should align data to 16 bytes in memory operations. Specifically, you
must align __m128 objects as addresses passed to the _mm_load and _mm_store
intrinsics. If you want to declare arrays of floats and treat them as __m128 objects by
casting, you need to ensure that the float arrays are properly aligned.
Use __declspec(align) to direct the compiler to align data more strictly than it
otherwise would. For example, a data object of type int is allocated at a byte address
which is a multiple of 4 by default. However, by using __declspec(align), you can
direct the compiler to instead use an address which is a multiple of 8, 16, or 32 with the
following restriction on IA-32:
You can use this data alignment support as an advantage in optimizing cache line
usage. By clustering small objects that are commonly used together into a struct, and
forcing the struct to be allocated at the beginning of a cache line, you can effectively
210
Intel(R) C++ Intrinsics Reference
guarantee that each object is loaded into the cache as soon as any one is accessed,
resulting in a significant performance benefit.
align(n)
Caution
Note
If a value is specified that is less than the alignment of the affected data type, it has no
effect. In other words, data is aligned to the maximum of its own alignment or the
alignment specified with __declspec(align).
You can request alignments for individual variables, whether of static or automatic
storage duration. (Global and static variables have static storage duration; local
variables have automatic storage duration by default.) You cannot adjust the alignment
of a parameter, nor a field of a struct or class. You can, however, increase the
alignment of a struct (or union or class), in which case every object of that type is
affected.
As an example, suppose that a function uses local variables i and j as subscripts into a
2-dimensional array. They might be declared as follows:
int i, j;
These variables are commonly used together. But they can fall in different cache lines,
which could be detrimental to performance. You can instead declare them as follows:
The compiler now ensures that they are allocated in the same cache line. In C++, you
can omit the struct variable name (written as sub in the previous example). In C,
however, it is required, and you must write references to i and j as sub.i and sub.j.
If you use many functions with such subscript pairs, it is more convenient to declare and
use a struct type for them, as in the following example:
By placing the __declspec(align) after the keyword struct, you are requesting the
appropriate alignment for all objects of that type. Note that allocation of parameters is
211
Intel® C++ Compiler for Linux* Intrinsics Reference
The _mm_malloc routine takes an extra parameter, which is the alignment constraint.
This constraint must be a power of two. The pointer that is returned from _mm_malloc is
guaranteed to be aligned on the specified boundary.
Note
Memory that is allocated using _mm_malloc must be freed using _mm_free . Calling
free on memory allocated with _mm_malloc or calling _mm_free on memory allocated
with malloc will cause unpredictable behavior.
Inline Assembly
By default, the compiler inlines a number of standard C, C++, and math library functions.
This usually results in faster execution of your program.
Sometimes inline expansion of library functions can cause unexpected results. The
inlined library functions do not set the errno variable. So, in code that relies upon the
setting of the errno variable, you should use the -nolib_inline option, which turns off
inline expansion of library functions. Also, if one of your functions has the same name as
one of the compiler's supplied library functions, the compiler assumes that it is one of the
latter and replaces the call with the inlined version. Consequently, if the program defines
a function with the same name as one of the known library routines, you must use the -
nolib_inline option to ensure that the program's function is the one used.
Note
212
Intel(R) C++ Intrinsics Reference
Automatic inline expansion of library functions is not related to the inline expansion that
the compiler does during interprocedural optimizations. For example, the following
command compiles the program sum.c without expanding the library functions, but with
inline expansion from interprocedural optimizations (IPO):
Caution
The Intel C++ Compiler does not support the mixing UNIX and MASM style asms.
Syntax Description
Element
asm- asm statements begin with the keyword asm. Alternatively, either __asm or
keyword __asm__ may be used for compatibility. See Caution statement.
volatile- If the optional keyword volatile is given, the asm is volatile. Two
keyword volatile asm statements will never be moved past each other, and a
reference to a volatile variable will not be moved relative to a volatile
asm. Alternate keywords __volatile and __volatile__ may be used for
compatibility.
asm- The asm-template is a C language ASCII string which specifies how to
template output the assembly code for an instruction. Most of the template is a fixed
string; everything but the substitution-directives, if any, is passed through
to the assembler. The syntax for a substitution directive is a % followed by
one or two characters. The supported substitution directives are specified
in a subsequent section.
asm- The asm-interface consists of three parts:
interface 1. an optional output-list
2. an optional input-list
3. an optional clobber-list
These are separated by colon (:) characters. If the output-list is
missing, but an input-list is given, the input list may be preceded by two
213
Intel® C++ Compiler for Linux* Intrinsics Reference
colons (::)to take the place of the missing output-list. If the asm-
interface is omitted altogether, the asm statement is considered
volatile regardless of whether a volatile-keyword was specified.
output- An output-list consists of one or more output-specs separated by
list commas. For the purposes of substitution in the asm-template, each
output-spec is numbered. The first operand in the output-list is
numbered 0, the second is 1, and so on. Numbering is continuous through
the output-list and into the input-list. The total number of operands
is limited to 30 (i.e. 0-29).
input-list Similar to an output-list, an input-list consists of one or more
input-specs separated by commas. For the purposes of substitution in
the asm-template, each input-spec is numbered, with the numbers
continuing from those in the output-list.
clobber- A clobber-list tells the compiler that the asm uses or changes a specific
list machine register that is either coded directly into the asm or is changed
implicitly by the assembly instruction. The clobber-list is a comma-
separated list of clobber-specs.
input-spec The input-specs tell the compiler about expressions whose values may
be needed by the inserted assembly instruction. In order to describe fully
the input requirements of the asm, you can list input-specs that are not
actually referenced in the asm-template.
clobber- Each clobber-spec specifies the name of a single machine register that is
spec clobbered. The register name may optionally be preceded by a %. You can
specify any valid machine register name. It is also legal to specify
"memory" in a clobber-spec. This prevents the compiler from keeping
data cached in registers across the asm statement.
• Instrinsics may generate code that does not run on all IA processors. You should
therefore use CPUID to detect the processor and generate the appropriate code.
• Implement intrinsics by processor family, not by specific processor. The guiding
principle for which family -- IA-32 or Itanium® processors -- the intrinsic is
implemented on is performance, not compatibility. Where there is added
performance on both families, the intrinsic will be identical.
214
Intel(R) C++ Intrinsics Reference
int abs(int)
long labs(long)
unsigned long __lrotl(unsigned long value, int shift)
unsigned long __lrotr(unsigned long value, int shift)
unsigned int __rotl(unsigned int value, int shift)
unsigned int __rotr(unsigned int value, int shift)
__int64 __i64_rotl(__int64 value, int shift)
__int64 __i64_rotr(__int64 value, int shift)
double fabs(double)
double log(double)
float logf(float)
double log10(double)
float log10f(float)
double exp(double)
float expf(float)
double pow(double, double)
float powf(float, float)
double sin(double)
float sinf(float)
double cos(double)
float cosf(float)
double tan(double)
float tanf(float)
double acos(double)
float acosf(float)
double acosh(double)
float acoshf(float)
double asin(double)
float asinf(float)
double asinh(double)
float asinhf(float)
double atan(double)
float atanf(float)
215
Intel® C++ Compiler for Linux* Intrinsics Reference
double atanh(double)
float atanhf(float)
float cabs(double)*
double ceil(double)
float ceilf(float)
double cosh(double)
float coshf(float)
float fabsf(float)
double floor(double)
float floorf(float)
double fmod(double)
float fmodf(float)
double hypot(double, double)
float hypotf(float)
double rint(double)
float rintf(float)
double sinh(double)
float sinhf(float)
float sqrtf(float)
double tanh(double)
float tanhf(float)
char *_strset(char *, _int32)
void *memcmp(const void *cs, const void *ct, size_t n)
void *memcpy(void *s, const void *ct, size_t n)
void *memset(void * s, int c, size_t n)
char *Strcat(char * s, const char * ct)
int *strcmp(const char *, const char *)
char *strcpy(char * s, const char * ct)
size_t strlen(const char * cs)
int strncmp(char *, char *, int)
int strncpy(char *, char *, int)
void *__alloca(int)
int _setjmp(jmp_buf)
_exception_code(void)
_exception_info(void)
_abnormal_termination(void)
void _enable()
216
Intel(R) C++ Intrinsics Reference
void _disable()
int _bswap(int)
int _in_byte(int)
int _in_dword(int)
int _in_word(int)
int _inp(int)
int _inpd(int)
int _inpw(int)
int _out_byte(int, int)
int _out_dword(int, int)
int _out_word(int, int)
int _outp(int, int)
int _outpd(int, int)
int _outpw(int, int)
Streaming
SIMD
Extensions
Streaming
SIMD
Extensions 2
_mm_empty A B
_mm_cvtsi32_si64 A A
_mm_cvtsi64_si32 A A
_mm_packs_pi16 A A
217
Intel® C++ Compiler for Linux* Intrinsics Reference
_mm_packs_pi32 A A
_mm_packs_pu16 A A
_mm_unpackhi_pi8 A A
_mm_unpackhi_pi16 A A
_mm_unpackhi_pi32 A A
_mm_unpacklo_pi8 A A
_mm_unpacklo_pi16 A A
_mm_unpacklo_pi32 A A
_mm_add_pi8 A A
_mm_add_pi16 A A
_mm_add_pi32 A A
_mm_adds_pi8 A A
_mm_adds_pi16 A A
_mm_adds_pu8 A A
_mm_adds_pu16 A A
_mm_sub_pi8 A A
_mm_sub_pi16 A A
_mm_sub_pi32 A A
_mm_subs_pi8 A A
_mm_subs_pi16 A A
_mm_subs_pu8 A A
_mm_subs_pu16 A A
_mm_madd_pi16 A C
_mm_mulhi_pi16 A A
_mm_mullo_pi16 A A
_mm_sll_pi16 A A
_mm_slli_pi16 A A
_mm_sll_pi32 A A
_mm_slli_pi32 A A
_mm_sll_pi64 A A
_mm_slli_pi64 A A
_mm_sra_pi16 A A
218
Intel(R) C++ Intrinsics Reference
_mm_srai_pi16 A A
_mm_sra_pi32 A A
_mm_srai_pi32 A A
_mm_srl_pi16 A A
_mm_srli_pi16 A A
_mm_srl_pi32 A A
_mm_srli_pi32 A A
_mm_srl_si64 A A
_mm_srli_si64 A A
_mm_and_si64 A A
_mm_andnot_si64 A A
_mm_or_si64 A A
_mm_xor_si64 A A
_mm_cmpeq_pi8 A A
_mm_cmpeq_pi16 A A
_mm_cmpeq_pi32 A A
_mm_cmpgt_pi8 A A
_mm_cmpgt_pi16 A A
_mm_cmpgt_pi32 A A
_mm_setzero_si64 A A
_mm_set_pi32 A A
_mm_set_pi16 A C
_mm_set_pi8 A C
_mm_set1_pi32 A A
_mm_set1_pi16 A A
_mm_set1_pi8 A A
_mm_setr_pi32 A A
_mm_setr_pi16 A C
_mm_setr_pi8 A C
219
Intel® C++ Compiler for Linux* Intrinsics Reference
Streaming
SIMD
Extensions 2
_mm_add_ss N/A B B
_mm_add_ps N/A A A
_mm_sub_ss N/A B B
_mm_sub_ps N/A A A
_mm_mul_ss N/A B B
_mm_mul_ps N/A A A
_mm_div_ss N/A B B
_mm_div_ps N/A A A
_mm_sqrt_ss N/A B B
_mm_sqrt_ps N/A A A
_mm_rcp_ss N/A B B
_mm_rcp_ps N/A A A
_mm_rsqrt_ss N/A B B
_mm_rsqrt_ps N/A A A
_mm_min_ss N/A B B
_mm_min_ps N/A A A
220
Intel(R) C++ Intrinsics Reference
_mm_max_ss N/A B B
_mm_max_ps N/A A A
_mm_and_ps N/A A A
_mm_andnot_ps N/A A A
_mm_or_ps N/A A A
_mm_xor_ps N/A A A
_mm_cmpeq_ss N/A B B
_mm_cmpeq_ps N/A A A
_mm_cmplt_ss N/A B B
_mm_cmplt_ps N/A A A
_mm_cmple_ss N/A B B
_mm_cmple_ps N/A A A
_mm_cmpgt_ss N/A B B
_mm_cmpgt_ps N/A A A
_mm_cmpge_ss N/A B B
_mm_cmpge_ps N/A A A
_mm_cmpneq_ss N/A B B
_mm_cmpneq_ps N/A A A
_mm_cmpnlt_ss N/A B B
_mm_cmpnlt_ps N/A A A
_mm_cmpnle_ss N/A B B
_mm_cmpnle_ps N/A A A
_mm_cmpngt_ss N/A B B
_mm_cmpngt_ps N/A A A
_mm_cmpnge_ss N/A B B
_mm_cmpnge_ps N/A A A
_mm_cmpord_ss N/A B B
_mm_cmpord_ps N/A A A
_mm_cmpunord_ss N/A B B
_mm_cmpunord_ps N/A A A
_mm_comieq_ss N/A B B
_mm_comilt_ss N/A B B
221
Intel® C++ Compiler for Linux* Intrinsics Reference
_mm_comile_ss N/A B B
_mm_comigt_ss N/A B B
_mm_comige_ss N/A B B
_mm_comineq_ss N/A B B
_mm_ucomieq_ss N/A B B
_mm_ucomilt_ss N/A B B
_mm_ucomile_ss N/A B B
_mm_ucomigt_ss N/A B B
_mm_ucomige_ss N/A B B
_mm_ucomineq_ss N/A B B
_mm_cvtss_si32 N/A A B
_mm_cvtps_pi32 N/A A A
_mm_cvttss_si32 N/A A B
_mm_cvttps_pi32 N/A A A
_mm_cvtsi32_ss N/A A B
_mm_cvtpi32_ps N/A A C
_mm_cvtpi16_ps N/A A C
_mm_cvtpu16_ps N/A A C
_mm_cvtpi8_ps N/A A C
_mm_cvtpu8_ps N/A A C
_mm_cvtpi32x2_ps N/A A C
_mm_cvtps_pi16 N/A A C
_mm_cvtps_pi8 N/A A C
_mm_move_ss N/A A A
_mm_shuffle_ps N/A A A
_mm_unpackhi_ps N/A A A
_mm_unpacklo_ps N/A A A
_mm_movehl_ps N/A A A
_mm_movelh_ps N/A A A
_mm_movemask_ps N/A A C
_mm_getcsr N/A A A
_mm_setcsr N/A A A
222
Intel(R) C++ Intrinsics Reference
_mm_loadh_pi N/A A A
_mm_loadl_pi N/A A A
_mm_load_ss N/A A B
_mm_load1_ps N/A A A
_mm_load_ps N/A A A
_mm_loadu_ps N/A A A
_mm_loadr_ps N/A A A
_mm_storeh_pi N/A A A
_mm_storel_pi N/A A A
_mm_store_ss N/A A A
_mm_store_ps N/A A A
_mm_store1_ps N/A A A
_mm_storeu_ps N/A A A
_mm_storer_ps N/A A A
_mm_set_ss N/A A A
_mm_set1_ps N/A A A
_mm_set_ps N/A A A
_mm_setr_ps N/A A A
_mm_setzero_ps N/A A A
_mm_prefetch N/A A A
_mm_stream_pi N/A A A
_mm_stream_ps N/A A A
_mm_sfence N/A A A
_mm_extract_pi16 N/A A A
_mm_insert_pi16 N/A A A
_mm_max_pi16 N/A A A
_mm_max_pu8 N/A A A
_mm_min_pi16 N/A A A
_mm_min_pu8 N/A A A
_mm_movemask_pi8 N/A A C
_mm_mulhi_pu16 N/A A A
_mm_shuffle_pi16 N/A A A
_mm_maskmove_si64 N/A A C
_mm_avg_pu8 N/A A A
_mm_avg_pu16 N/A A A
223
Intel® C++ Compiler for Linux* Intrinsics Reference
_mm_sad_pu8 N/A A A
On processors that do not support SSE2 instructions but do support MMX Technology,
you can use the sse2mmx.h emulation pack to enable support for SSE2 instructions.
You can use the sse2mmx.h header file for the following processors:
• Itanium® Processor
• Pentium® III Processor
• Pentium® II Processor
• Pentium® with MMX™ Technology
224
Index
double complex ........................... 10
E
time stamp................................... 15
EMMS Instruction
sample code...................................... 6
about............................................... 22
sample code.................................... 10
using ............................................... 23
sample code.................................... 15
EMMS Instruction............................... 23
using ................................................. 4
I
M
intrinsics
macros
about................................................. 1
for SSE3........................................ 173
arithmetic intrinsics 17, 28, 45, 93, 124
matrix transposition......................... 92
data alignment .............. 208, 209, 210
read and write control registers....... 90
data types ......................................... 2
shuffle for SSE ................................ 89
floating point 17, 44, 93, 98, 100, 111,
115, 118, 121, 169, 171 shuffle for SSE2 ............................ 167
225