Introduction
I recently had to use Intel’s SIMD intrinsics to program SIMD parallelism on the CPU for some performance benchmarking. I compared the time to compute a mathematical algorithm in serial on the CPU, on the CPU using SIMD extensions, on the GPU using OpenCL and OpenGL. Surprisingly, this is the first time I have used SIMD intrinsics! Even though I have spent a lot of time studying and working on parallelism and high performance programs. I was suprised at the lack of accessible introductory guides and tutorials to get started on this.
There are a couple of alternative methods for programming CPU SIMD supporting hardware. These are classified as implicit and explicit methods. Implicit methods I’d consider as abstracted from the SIMD process itself. An example would be Intel’s Math Kernel Library which provides C functions for maths which just need to be applied and within the library the SIMD manipulations take place. It is limited in that you are restricted to the functionality the library makes available, and more specific/complex things will be missing. Explicit methods require the programmer to understand the SIMD architecture. An example would be OpenMP’s SIMD capabilities. Although you are not expected to instruct the compiler how to vectorize directly, you still need to understand how the data structures you give will be vectorized and processed separately. The most explicit, low level power comes from using the compiler’s intrinsics directly. This is what I’ll be covering, but for a bit more describing the different methods, See here.
There are a collection of different SIMD technologies which are commonly found on CPUs:
- SSE/2/3/4 - An earlier instruction set designed by Intel and support found in the majority of CPUs on the market.
- AVX/AVX-512 = A new instruction set, proposed by Intel. Provides larger register sizes of 256 bits and up to 512 bits. Typically considered faster than SSE.
Depending on which of the SIMD extension technologies you want to target, the CPU must support it.
Intrinsics are functions available for use in a programming language which are handled specifically by the compiler.
Vectorising manually does require a little more thought and closer consideration of the SIMD process. I would say even more so than using GPGPU frameworks and especially more than implicit SIMD methods.
Vectorization
An important part of satisfying SIMD requirements is that the data to be processed must be structured correctly. What this means is that you cannot give a SIMD processor some array of floats and expect it to work out how to partition it so it receives portions of the array with the same width as the SIMD registers. The programmer must explicitly partition the array into sections that fit across the SIMD registers and can then be processed in parallel and output in the same partitioned array.
Take for example a situation where a list of floats will be processed using the SIMD extension device with a width of 128bits. Typically a float is defined as 32bits in a system. Therefore this 128-SIMD processor can accept 4 floats at a time. This means that the standard array of floats within the program must be loaded into vector floats of 4 at a time. The SIMD processor can then apply a vector operation on all 4 floats at once. The resulting vector can then be stored back into an array or used in a following calculation.
Example Program
Here we’ll cover a simple program which utilizes intel’s SIMD technology. The full program can be found here.
There are much better ways to do this example with better performance, but I felt like how I wrote it breaks the process up for learning. I might come back and add the ideal version of this example later.
Intel’s Intrinsics Guide is the best place of reference for all SIMD technology available. You can filter the functions you’re looking for and honestly the descriptions are pretty self-explanatory. So far it is all I’ve needed to work out how to use SIMD intrinsics. In more advance cases, you might be required to search for other document or examples elsewhere.
Constant Values
//Load value 2 into vector//
const float two = 2.0;
__m128 scalarTwo = _mm_load_ps1(&two);
To use a constant value within vector calculations, a vector of the same value must be created. The code above uses the conveniently defined _mm_load_ps1(&address) to load a float into all elements of the vector it returns. This is stored and can then be used in later calculations.
Increment over array
for (int i = 0; i < size-4; i+=4)
{
...
}
Here the calculations will take place on the vectors. Because the SIMD process completes 4 at a time, the index will increment at this rate.
//Load array into a vector//
bVector = _mm_load_ps(b + i);
Here b is a pointer to the beginning of the array to be processed. By adding i to it, the address of the current position in the array to process is retrieved. This address is passed into _mm_load_ps() which loads the conventional memory address and next 3 values into a special vector of 4 floats which can be used in the SIMD operator functions.
//Vector multiply array values with scalar of value 2//
bVector = _mm_mul_ps(bVector, scalarTwo);
Here _mm_mul_ps() takes two vectors and vector multiplies them. The result is a vector of 4 floats which is stored back into bVector.
//Store result from vector back into the array//
_mm_store_ps(b+i, bVector);
With the result of the multiplication(Which is just doubling every element) held in the bVector. _mm_store_ps() takes bVector and stores the 4 values within it beginning from the starting address passed to the function.
After all iterations, what this results in is all the values in the array being loaded out into vectors, multiplied by 2, then stored back into the array, now doubled.
Intrinsic Function Pattern
You should be able to start to tell the pattern of the functions followed. The first part, _mm describes the type of SIMD extension instructions being used. In this case it refers to basic 128bit SSE. The _mul is the type of operation being performed. In this case multiplication. The final part _ps describes that the function operates on single precision floating-point values.
So to make this clear, another completely different function _mm256_add_pd() is a AVX-256 instruction which adds two 256bit vectors together packed in vectors of double precision floating-point values.
Handling Memory
An important thing to be aware of is that you are handling pointers to memory with full control. You need to be aware of what memory you are accessing, most importantly what you write to. For example, if you write a loop which doesn’t take into account that the vector operator will write to areas of memory beyond the array, any crucial data could be overwritten. This can easily cause strange behaviour and even a system crash! (As I experienced myself;)
const int size = 22;
int array[size];
for (int i = 0; i < size-4; i+=4)
{
_mm_store_ps(array, scalarTwo); // (size % 4) != 0 - Therefore undefined behaviour on last iteration!!!
}
So here, an array is created allocated a length of 22. However, on the last iteration the _mm_store_ps() function stores 4 floats starting at index 20. Therefore values are written to positions 20, 21, 22 and 23. The last two positions in memory written to are not included in the array and therefore could hold values for important data in the system which are now overwritten.
Conclusion
I quickly ran the program and recorded the following compute times for the ‘problem’:
Array size = 400000000
Number of times doubled = 8
Serial: 9.74226s
SSE: 4.74564s
SIMD intrinsics can be powerful. Running this very simple program, it can be seen that the SSE processing vectors of 4 floats is X times faster than a purely serial method. For this kind of problem, once you are aware of the intrinsics it is actually really easy (but potentially hazardous) to use. For more complex problems, you will need a better understanding of processing vectors and how to do it correctly to increase performance and avoid errors.