SSSE3: fast popcount

Author:Wojciech Muła
Added on:24.05.2008
Updates:28.06.2010 (link to Peter's site), 27.03.2010 (SSE2 procedures), 22.10.2011 (link to paper, unrolled loops)

2011-10-11: Paper Anatomy of High-Performance 2D Similarity Calculations contains interesting comparison of similarity calculations which heavily use popcount operation. Authors compared 4 basic methods: 1) hardware-based, i.e. popcnt instruction, 2) simple LUT, 3) bit-level parallel method (SSE2 procedure), and 4) method described here. The most important observation, from my point of view of course, is that speed of SSSE3 code is comparable to hardware popcnt, it is just a bit slower. And while Peter's tests were synthetic, here we got results from real-word application.

Authors published also full source code and I noticed they manually unrolled inner loop. I did the same in my code and speedup increased from 3 to 4-5 — sources and article has been updated.

2010-06-28: Peter Kankowski compared speed of SSE4.2 instructions crc32 and popcnt against software implementations. Hardware CRC32 is significantly faster, but population count is slightly slower than algorithm presented here!

Population count is a procedure of counting number of ones in a bit string. Intel announced instruction popcount (or popcnt) to be appear with SSE4.2 instruction set; this instruction will operate on 8, 16, 32 and 64 bits.

However SSSE3 brought powerful instruction PSHUFB. This instruction can be used to perform parallel 16-way lookup; LUT has 16 entries and is stored in a XMM register, indexes are 4 lower bits of each byte stored in another XMM register.

With help of PSHUFB we can get vector that contains population count for 16 nibbles. To get vector of population count for each 16 byte PSHUFB have to be called twice on vectors of lower and higher nibbles, and finally added together.

Following code shows the idea:

; xmm0 - input (16 bytes)
; xmm7 - POPCOUNT_4bit  -- lookup table
; xmm6 - MASK_bits03 = packed_byte(0x0f) -- mask 4 lower bits

movdqa  %%xmm0, %%xmm1
psrlw       $4, %%xmm1

pand    %%xmm6, %%xmm0  ; xmm0 - lower nibbles
pand    %%xmm6, %%xmm1  ; xmm1 - higher nibbles

movdqa  %%xmm7, %%xmm2  ; since instruction pshufb modifes LUT
movdqa  %%xmm7, %%xmm3  ; it must be saved for further use

pshufb  %%xmm0, %%xmm2  ; xmm2 = vector of popcount for lower nibbles
pshufb  %%xmm1, %%xmm3  ; xmm3 = vector of popcount for higher nibbles

paddb   %%xmm3, %%xmm2  ; xmm2 += xmm3 -- vector of popcount for bytes

The last step is adding all bytes from vector.

Instruction PSADBW calculate sum of absolute differences of unsigned bytes — if first arguments is full of zeros, result is a sum of bytes from second argument. Unfortunately PSADBW invoked with 128-bits arguments calculate separate sums for bytes 0..7 and 8..15, and finally stores them in lower and higher quad words. Because of that few additional instructions are needed:

pxor    %%xmm0, %%xmm0  ; xmm0 = packed_byte(0x00)
psadbw  %%xmm0, %%xmm3  ; xmm3 = [popcount of bytes 0..7 | popcount of bytes 8..15]
movhlps %%xmm3, %%xmm0  ; xmm0 = [         0             | popcount of bytes 0..7 ]
paddd   %%xmm3, %%xmm0  ; xmm0 = [     not needed        | popcount of bytes 0..15]

Further improvements

PSADBW has 3 or 4 cycles latency, also additional instructions need some time to execute (I guess around 2 cycles).

PSADBW don't have to be called in every iteration — since max values of popcount for single byte is 8, we can perform up to floor(255/8)=31 parallel additions (PADDB) without overflow! Moreover, partial sums returned by PSADBW could be added together in the end.

Pseudocode:

pxor %%xmm5, %%xmm5             // global accumulator

while (bytes to end > 0) {
        pxor %%xmm4, %%xmm4     // local accumulator (for inner loop)

        n = min(bytes to end/16, 31)    // up to 31 blocks
        for (i=0; i < n; i++) {
                // calculate xmm3, a vector of popcount for bytes

                paddb %%xmm3, %%xmm4    // xmm4 += xmm3 -- update local acc.
        }

        pxor   %%xmm0, %%xmm0
        psadbw %%xmm4, %%xmm0   // xmm4 -- calculate two popcounts

        // update global acc.
        paddd  %%xmm4, %%xmm5
}

// add halfs of global accumulator
movhlps %%xmm5, %%xmm0
paddd   %%xmm5, %%xmm0
movd    %%xmm0, %%eax   // eax = population count for all bytes

Test

ssse3_popcount.c is a test program that contains implementations of following procedures:

First argument of program is a function name, second is number of 16-byte chunks processed by selected procedure in one iteration and third is number of iterations.

Table shows results for different chunk count; test script I've used is available. Program was compiled with following options:

gcc -O2 -DALIGN_DATA ssse3_popcount.c -o ssse3_popcount

Tests were run on my Linux box, with Core 2 Duo E8200;

chart

Results clearly show, that method presented above brings significant speedup, which depends on data size.

Straightforward SSSE3 implementation is 2-2.8 times faster, improved around 3 times, and unrolled 4-5 times.

procedure number of 16-byte chunks iterations time [s] speedup
lookup 1 20,000,000 0.22 100%
sse2-1 0.19 115%
sse2-2 0.20 110%
ssse3-1 0.14 157%
ssse3-2 0.16 137%
lookup 8 20,000,000 1.42 100%
sse2-1 0.92 154%
sse2-2 0.79 179%
ssse3-1 0.61 232%
ssse3-2 0.52 273%
lookup 32 2,000,000 0.55 100%
sse2-1 0.34 161%
sse2-2 0.30 183%
sse2-unrl 0.22 250%
ssse3-1 0.22 250%
ssse3-2 0.19 289%
ssse3-unrl 0.12 458%
lookup 128 200,000 0.21 100%
sse2-1 0.13 161%
sse2-2 0.11 190%
sse2-unrl 0.08 262%
ssse3-1 0.08 262%
ssse3-2 0.07 299%
ssse3-unrl 0.04 525%
lookup 512 200,000 0.86 100%
sse2-1 0.53 162%
sse2-2 0.45 191%
sse2-unrl 0.34 252%
ssse3-1 0.34 252%
ssse3-2 0.26 330%
ssse3-unrl 0.18 477%
lookup 1024 200,000 1.73 100%
sse2-1 1.07 161%
sse2-2 0.90 192%
sse2-unrl 0.68 254%
ssse3-1 0.69 250%
ssse3-2 0.52 332%
ssse3-unrl 0.38 455%
lookup 2048 200,000 3.47 100%
sse2-1 2.14 162%
sse2-2 1.80 192%
sse2-unrl 1.37 253%
ssse3-1 1.38 251%
ssse3-2 1.06 327%
ssse3-unrl 0.76 456%