Author: | Wojciech Muła |
---|---|

Added on: | 24.05.2008 |

Updates: | 04.03.2015 (link to github repo), 22.10.2011 (link to paper, unrolled loops), 28.06.2010 (link to Peter's site), 27.03.2010 (SSE2 procedures), |

2015-03-04: github repository contains the original code from 2008 and
also the new C++11, intrinsics-based implementation. BTW delaying byte-wise
sum (the trick with `PSADBW`) applied for bit parallel method gives
50% speedup over plain, SWAR 64-bit procedure.

2011-10-11: Paper Anatomy of High-Performance 2D Similarity
Calculations contains interesting comparison of similarity calculations
which heavily use popcount operation. The authors compared 4 basic methods: 1)
hardware-based, i.e. `popcnt` instruction, 2) simple LUT, 3) bit-level
parallel method (SSE2 procedure), and 4) the method described here. The most
important observation, from my point of view of course, is that the speed of
SSSE3 code is **comparable** to hardware `popcnt`, it is just a bit
slower. And while Peter's tests were synthetic, here we got results from
real-word application.

The authors published also the full source code and I noticed they manually unrolled inner loop. I did the same in my code and speedup increased from 3 to 4-5 — sources and article has been updated.

2010-06-28: **Peter Kankowski** compared speed of SSE4.2
instructions `crc32` and `popcnt` against software implementations.
Hardware CRC32 is significantly faster, but population count is
slightly **slower** than the algorithm presented here!

Population count is a procedure of counting number of ones in a bit
string. Intel announced instruction `popcount` (or `popcnt`)
to be appear with SSE4.2 instruction set; this instruction
will operate on 8, 16, 32 and 64 bits.

However SSSE3 brought powerful instruction `PSHUFB`. This
instruction can be used to perform a **parallel** 16-way lookup; LUT has
16 entries and is stored in an XMM register, indexes are 4 lower bits of
each byte stored in another XMM register.

With help of `PSHUFB` we can get a vector that contains population count
for 16 nibbles. To get a vector of population count for each 16 byte,
instruction `PSHUFB` have to be called twice on vectors of lower and higher
nibbles, and finally added together.

Following code shows the idea:

; xmm0 - input (16 bytes) ; xmm7 - POPCOUNT_4bit -- lookup table ; xmm6 - MASK_bits03 = packed_byte(0x0f) -- mask 4 lower bits movdqa %%xmm0, %%xmm1 psrlw $4, %%xmm1 pand %%xmm6, %%xmm0 ; xmm0 - lower nibbles pand %%xmm6, %%xmm1 ; xmm1 - higher nibbles movdqa %%xmm7, %%xmm2 ; since instruction pshufb modifes LUT movdqa %%xmm7, %%xmm3 ; it must be saved for further use pshufb %%xmm0, %%xmm2 ; xmm2 = vector of popcount for lower nibbles pshufb %%xmm1, %%xmm3 ; xmm3 = vector of popcount for higher nibbles paddb %%xmm3, %%xmm2 ; xmm2 += xmm3 -- vector of popcount for bytes

The last step is adding all bytes from vector.

Instruction `PSADBW` calculate sum of absolute differences of
unsigned bytes — if the first arguments is full of zeros, then result is a
sum of bytes from second argument. Unfortunately `PSADBW` invoked
with 128-bits arguments calculate separate sums for bytes 0..7 and
8..15, and finally stores them in the lower and the higher quad words.
Because of that few additional instructions are needed:

pxor %%xmm0, %%xmm0 ; xmm0 = packed_byte(0x00) psadbw %%xmm0, %%xmm3 ; xmm3 = [popcount of bytes 0..7 | popcount of bytes 8..15] movhlps %%xmm3, %%xmm0 ; xmm0 = [ 0 | popcount of bytes 0..7 ] paddd %%xmm3, %%xmm0 ; xmm0 = [ not needed | popcount of bytes 0..15]

`PSADBW` has 3 or 4 cycles latency, also additional instructions
need some time to execute (I guess around 2 cycles).

`PSADBW` don't have to be called in every iteration — since max
values of popcount for single byte is 8, we can perform up to
`floor(255/8)=31` parallel additions (`PADDB`) without overflow!
Moreover, partial sums returned by `PSADBW` could be added together in
the end.

Pseudocode:

pxor %%xmm5, %%xmm5 // global accumulator while (bytes to end > 0) { pxor %%xmm4, %%xmm4 // local accumulator (for inner loop) n = min(bytes to end/16, 31) // up to 31 blocks for (i=0; i < n; i++) { // calculate xmm3, a vector of popcount for bytes paddb %%xmm3, %%xmm4 // xmm4 += xmm3 -- update local acc. } pxor %%xmm0, %%xmm0 psadbw %%xmm4, %%xmm0 // xmm4 -- calculate two popcounts // update global acc. paddd %%xmm4, %%xmm5 } // add halfs of global accumulator movhlps %%xmm5, %%xmm0 paddd %%xmm5, %%xmm0 movd %%xmm0, %%eax // eax = population count for all bytes

ssse3_popcount.c is a test program that contains implementations of following procedures:

`lookup`— popcount based on LUT with 256 entries; I tested GCC`__builtin_popcount`, however it was much slower than my implementation`ssse3-1`— straightforward SSSE3 implementation`ssse3-2`— improved SSSE3 implementation`ssse3-unrl`—`ssse3-2`with inner loop unrolled 4 times`sse2-1`— SSE2 bit-level parallel implementation`sse2-2`— improved SSE2 implementation (using the same tricks as SSSE3 version)`see2-unrl`—`ssee2-2`with inner loop unrolled 4 times

The first argument of the program is a function name, the second is the number of 16-byte chunks processed by the selected procedure in one iteration and the third is the iterations number.

The table shows results for different chunk count; test script I've used is available. Program was compiled with following options:

gcc -O2 -DALIGN_DATA ssse3_popcount.c -o ssse3_popcount

Tests were run on my Linux box, with Core 2 Duo E8200;

Results clearly show, that the method presented above brings significant speedup, which depends on the data size.

The straightforward SSSE3 implementation is 2-2.8 times faster, the improved around 3 times, and the unrolled 4-5 times.

procedure | number of 16-byte chunks | iterations | time [s] | speedup |
---|---|---|---|---|

lookup |
1 | 20,000,000 | 0.22 | 100% |

sse2-1 |
0.19 | 115% | ||

sse2-2 |
0.20 | 110% | ||

ssse3-1 |
0.14 | 157% | ||

ssse3-2 |
0.16 | 137% | ||

lookup |
8 | 20,000,000 | 1.42 | 100% |

sse2-1 |
0.92 | 154% | ||

sse2-2 |
0.79 | 179% | ||

ssse3-1 |
0.61 | 232% | ||

ssse3-2 |
0.52 | 273% | ||

lookup |
32 | 2,000,000 | 0.55 | 100% |

sse2-1 |
0.34 | 161% | ||

sse2-2 |
0.30 | 183% | ||

sse2-unrl |
0.22 | 250% | ||

ssse3-1 |
0.22 | 250% | ||

ssse3-2 |
0.19 | 289% | ||

ssse3-unrl |
0.12 | 458% | ||

lookup |
128 | 200,000 | 0.21 | 100% |

sse2-1 |
0.13 | 161% | ||

sse2-2 |
0.11 | 190% | ||

sse2-unrl |
0.08 | 262% | ||

ssse3-1 |
0.08 | 262% | ||

ssse3-2 |
0.07 | 299% | ||

ssse3-unrl |
0.04 | 525% | ||

lookup |
512 | 200,000 | 0.86 | 100% |

sse2-1 |
0.53 | 162% | ||

sse2-2 |
0.45 | 191% | ||

sse2-unrl |
0.34 | 252% | ||

ssse3-1 |
0.34 | 252% | ||

ssse3-2 |
0.26 | 330% | ||

ssse3-unrl |
0.18 | 477% | ||

lookup |
1024 | 200,000 | 1.73 | 100% |

sse2-1 |
1.07 | 161% | ||

sse2-2 |
0.90 | 192% | ||

sse2-unrl |
0.68 | 254% | ||

ssse3-1 |
0.69 | 250% | ||

ssse3-2 |
0.52 | 332% | ||

ssse3-unrl |
0.38 | 455% | ||

lookup |
2048 | 200,000 | 3.47 | 100% |

sse2-1 |
2.14 | 162% | ||

sse2-2 |
1.80 | 192% | ||

sse2-unrl |
1.37 | 253% | ||

ssse3-1 |
1.38 | 251% | ||

ssse3-2 |
1.06 | 327% | ||

ssse3-unrl |
0.76 | 456% |