Author: | Wojciech Muła |
---|---|

Added on: | 2016-12-16 |

AMD XOP defines instruction `VPTERNB` which does lookup in a pair of SSE
registers. The instruction is similar to `PSHUFB`, but apart of wider,
5-bit index, it allows to perform several extra operations based on the
higher 3-bits.

I showed that `PSHUFB` can be used to implement population count
procedure. With 4-bit indices such procedure is straightforward. We split
a byte vector into two halves, and invoke the instruction twice, getting
popcount for both nibbles. Next these popcounts are added together, forming
8-bit counters, which are added in the end.

Similar procedure can be build around `VPTERNB`. However, to fully utilize
5-bit indices a slightly different strategy is needed. We process two vectors
in one step treating byte pairs as 16-bit words. We call `VPTERNB` to
calculate popcount of three 5-bit fields, one remaining bit of the 16-bit word
is counted separately.

We process two SSE registers (`vec0` and `vec1`) in one step.

vec0 = packed_byte(abcd_efgh) // a .. p are bits */ vec1 = packed_byte(ijkl_mnop)

- First we calculate population count of lower five bits of both registers.

// t0 = packed_byte(000d_efgh) // t1 = packed_byte(000l_mnop) const __m128i t0 = _mm_and_si128(vec0, packed_byte(0x1f)); const __m128i t1 = _mm_and_si128(vec1, packed_byte(0x1f)); // popcnt for bits d..h const __m128i popcnt1 = _mm_perm_epi8(lookup_lo, lookup_hi, t0); // popcnt for bits l..p const __m128i popcnt2 = _mm_perm_epi8(lookup_lo, lookup_hi, t1);

- Then we build a 5-bit word using three remaining bits from
`vec0`(abc) and`vec1`(jk). We merge these bit-fields using XOP-instruction`VCMOV`, a condition move based on the bit-mask.

// t2 = packed_byte(??ij_klmn) const __m128i t2 = _mm_srli_epi16(vec1, 2); // t3 = packed_byte(...._.abc) const __m128i t3 = _mm_and_si128(_mm_srli_epi16(vec0, 5), _mm_set1_epi8(0x07)); // t4 = packed_byte(...j_kabc) const __m128i t4 = _mm_cmov_si128(t2, t3, _mm_set1_epi8(0x18)); // popcnt3 = popcount of bits jkabc const __m128i popcnt3 = _mm_perm_epi8(lookup_lo, lookup_hi, t4);

- At this point vectors
`popcnt1`,`popcnt2`and`popcnt3`have got population count for all bits except the bit i. This bit is present in vector`t4`at position 4th. We separately accumulate popcounts from vectors`popcnt{1,2,3}`and the bit i. The bit i is not shifted, we simply mask it and keep the counter shifted right, that limits its capacity to 7.

// update counter local_a = _mm_add_epi8(local_a, popcnt1); local_a = _mm_add_epi8(local_a, popcnt2); local_a = _mm_add_epi8(local_a, popcnt3); // update counter for bits i local_b = _mm_add_epi8(local_b, _mm_and_si128(t2, _mm_set1_epi8(0x20)));

We repeat the procedure six times, and then add counters `local_a` and
`local_b`. The resulting vector `local` contains in each byte population
count of two bytes from the input vectors.

local_b = _mm_srli_epi16(local_b, 5); const __m128i local = _mm_add_epi8(local_a, local_b);

Sample implementation is available in file `src/xop_hamming_weight.cpp` on
github repository accompanying our paper Faster Population Counts using
AVX2 Instructions.

Following timings were obtained on AMD FX(tm)-8150 Eight-Core Processor with
program `basic_benchmark` from the repository. Comparison includes just
three procedures:

`PSHUFB`-based (`sse_bitset64_weight`),- SSE Harley-Seal (
`sse_harley_seal_bitset64_weight`), - and
`VPTERNB`-based (`xop_bitset64_weight`).

The Harley-Seal method is the fastest. However, when compare parallel-lookup
methods, we see that one presented in this article can be 1.3 times faster
than `PSHUFB`-based.

64-bit words | cycles per 64-bit word | ||
---|---|---|---|

PSHUFB-based |
VPTERNB-based |
Harley-Seal | |

128 | 0.35 | 0.26 | 0.40 |

192 | 0.71 | 0.59 | 0.68 |

256 | 0.96 | 0.83 | 0.81 |

384 | 1.11 | 0.99 | 0.95 |

512 | 1.19 | 1.12 | 1.07 |

768 | 1.26 | 1.16 | 1.13 |

1024 | 1.30 | 1.24 | 1.15 |

2048 | 1.52 | 1.34 | 1.26 |

4096 | 1.54 | 1.34 | 1.26 |

8192 | 1.56 | 1.34 | 1.26 |

12288 | 1.56 | 1.35 | 1.26 |