Author: | Wojciech Muła |
---|---|

Added on: | 2016-10-08 |

Presented method allows to sort a whole AVX512 register or its subrange, it is a variant of counting sort. The time complexity is linear, moreover method works entirely on registers, no extra memory operations are done. It may also be easily extended to sorting more than one register.

The method is suitable for sorting 32- and 64-bit integers, and also floating point numbers, both single and double precision.

A single step of the algorithm consist:

- Broadcast i-th element of the input.
- Compare for less the broadcasted vector with the input. This yields the number of items less than i-th item.
- Likewise compare for equality. This yields number of item's repetitions
- Having these two numbers we can form a mask which can be used to merge the broadacted vector with a result vector, putting i-th number at the correct position.

This is repeated for **every item** of input. When sorting 32-bit numbers
16 iterations have to be done.

result = [ 0, 0, 0, 0, 0, 0, 0, 0] input = [ 10, 1, 5, 20, 10, 8, 60, 99] brds = [ 10, 10, 10, 10, 10, 10, 10, 10] - broadcast 10 less = [ 0, 1, 1, 0, 0, 1, 0, 0] -> 3 items less than 10 equal = [ 1, 0, 0, 0, 1, 0, 0, 0] -> 10 repeats 2 times mask = (1 << (3 + 2)) - (1 << 3) = 0b00011000 result = merge(brds, mask) [ 0, 0, 0, 10, 10, 0, 0, 0]

Sample implementation with loop. It is slower than fully unrolled code.

__m512i avx512_sort_loop_epi32(const __m512i v) { __m512i result = _mm512_setzero_si512(); __m512i index = _mm512_setzero_si512(); __m512i incr = _mm512_set1_epi32(1); for (int i=0; i < 16; i++) { const __m512i b = _mm512_permutexvar_epi32(index, v); const uint16_t lt = _mm_popcnt_u32(_mm512_cmplt_epi32_mask(v, b)); const uint16_t eq = _mm_popcnt_u32(_mm512_cmpeq_epi32_mask(v, b)); const uint16_t mask = (uint32_t(1) << (lt + eq)) - (uint32_t(1) << lt); result = _mm512_mask_mov_epi32(result, mask, b); index = _mm512_add_epi32(index, incr); } return result; }

Compiler: GCC 5.3.0 CPU: Knights Landing 7120

Sorting 16 x 32-bit numbers (one AVX512 register)

algorithm | time [s] |
---|---|

std::sort | 1.53 |

insertion sort | 5.55 |

AVX512F unrolled | 1.77 |

AVX512F (for loop) | 2.04 |

Sorting 32 x 32-bit numbers (two AVX512 registers)

algorithm | time [s] |
---|---|

std::sort | 5.24 |

insertion sort | 12.99 |

AVX512F unrolled | 5.78 |

Speed of vectorized algorithms is comparable to std::sort for C++ library. However, the algorithms are not meant to be used as a replacement of library function, but as a part of other algorithms. For instance the unrolled vectorized algorithm was used in quicksort implementation to sort short ranges, making the whole sorting substantially faster.

Github repository contains various variants of sorting and a test program.