Notes

Author:Wojciech Muła

Main page

Table of contents

Branchless set mask if value greater or how to print hex values [9.06.2010]

Suppose we need to get mask when nonnegative argument is greater then some constant value; in other words, we want to evaluate following expression:

if x > const_n then
   mask := 0xffffffff;
else
   mask := 0x00000000;

Portable branchless solution:

The key to understand this trick is binary form of M: 0111..1111zzzz, where z is 0 or 1 depending on n value. When x is greater then n, then x + M has form 1000..000zzzz, because carry bit propagate through series of ones to k-th position of result.

Real world example — branchless converting hex digit to ASCII (M=0x7ffffff6 for k=31 and n=9).

; input:    eax - hex digit
; output:   eax - ASCII letter (0-9, A-F or a-f)
; destroys: ebx

        andl 0xf, %eax
        leal 0x7ffffff6(%eax), %ebx     ; MSB(ebx)=1 when eax >= 10
        sarl $31, %ebx                  ; ebx - mask
        andl  $7, %ebx                  ; ebx = 7 when eax >= 10 (for A-F letters)
        ;andl $39, %ebx                 ; ebx = 39 when eax >= 10 (for a-f letters)
        leal '0'(%eax, %ebx), %eax      ; eax = '0' + eax + ebx => ASCII letter

It is also possible to convert 4 hex digits in parallel using similar algorithm, but input data have to be correctly prepared. Moreover generating mask requires 3 instructon and one extra register (in scalar version just one arithmetic shift). I guess it wont be fast on x86, maybe this approach would be good for SIMD code, where similar code transforms more bytes at once.

; input: eax - four hex digits in form [0a0b0c0d]
; output: eax - four ascii letters
; destroys: ebx, ecx

        leal 0x76767676(%eax), %ebx        ; MSB of each byte is set when corresponding eax byte is >= 10
                                           ; (here: 0x7f - 9 = 0x76)
        andl $0x80808080, %ebx
        movl %ebx, %ecx
        shrl    $7, %ebx
        subl %ebx, %ecx                    ; ecx - byte-wise mask
        ;andl $0x07070707, %ecx            ; for ASCII letters A-F
        andl $0x27272727, %ecx             ; for ASCII letters a-f
        leal 0x30303030(%eax, %ecx), %eax  ; ecx - four ascii letters

See also: SSSE3: printing hex values (weird use of PSHUFB instruction)

Speedup reversing table of bytes [1.05.2010]

In-place reversing is performed by following algorithm:

// n - table size

int i = 0;      // index1
int j = n - 1;  // index2
int c = n/2;    // counter
int x, y;
while (c--)
        x = table[i];
        y = table[j];
        table[i] = y;
        table[j] = x;
        i++;
        j--;
        c--;
}

We can use some CPU-specific instruction to speedup this algorithm.

  1. 486 processors have BSWAP instruction that swaps four bytes (quick conversion between big-/little-endian);
  2. SSE2 extension have PSHUFD that calculate any combination of dwords in a xmm register, likwise PSHUFLW and PSHUFHW calculate combinations of words;
  3. SSSE3 extension have PSHUFB that calculate any combination of 16 bytes from a xmm register.

Program can load part of table into registers, then swap bytes in register using one or more instructions and then store registers content at new position. If size of table is not multiply of register size, then few bytes in middle of a table have to be swapped using default algorithm.

Modified algorithm outline:

// n - table size
// k - register size - k=4 for BSWAP and k=16 for PSHUFB

int i = 0;              // index1
int j = n - k - 1;      // index2  !!!
int c = n/(2*k);        // counter !!!
register char x[k], y[k];       // hardware registers
while (c--)
        x = table[i];           // load k bytes
        y = table[j];           // ...
        hardware_swap(x);       // reverse these bytes
        hardware_swap(y);       // ...
        table[i] = y;           // store k bytes
        table[j] = x;           // ..
        i += k;
        j -= k;
        c--;
}

if (n % (2*k) != 0)
        swap rest byte-by-byte

Function hardware_swap(x) is equivalent to single instruction: BSWAP reg or PSHUFB xmm; SSE2 code requires more instructions:

movaps  %xmm0, %xmm1
psrlw      $8, %xmm0
psllw      $8, %xmm1
por     %xmm1, %xmm0            ; swap adjacent bytes

pshufd  $0x4e, %xmm0, %xmm0     ; swap order of dwords
pshuflw $0x1b, %xmm0, %xmm0     ; swap order of words in lower
pshufhw $0x1b, %xmm0, %xmm0     ; .. and higher half of register

Test program contains several procedures that implement modified algorithm:

  1. swap_tab_asm (alias x86) — asm implementation of default algorithm (reference)
  2. swap_tab_bswap (bswap) — modified algorithm: one bswap (8 bytes per iteration)
  3. swap_tab_bswap_unrolled (bswap2) — modified algorithm: two bswap (16 bytes per iteration)
  4. swap_tab_sse (sse) — modified algorithm: one SSE swap (32 bytes per iteration)
  5. swap_tab_sse_unrolled (sse2) — modified algorithm: two SSE swaps (64 bytes per iteration)
  6. swap_tab_pshufb (pshuf) — modified algorithm: one PSHUFB (32 bytes per iteration)
  7. swap_tab_pshufb_unrolled (pshuf2) — modified algorithm: two PSHUFB (64 bytes per iteration)

Important note: SSE implementations use safe MOVUPS instructions for transfers data from/to memory. This instruction is much slower then MOVAPS even if accessing aligned data!

Tests

Two scenarios of test were considered:

  1. table size is hardware friendly, i.e. is multiply of implementation base step; also address of table is aligned
    • all procedures use MOVUPS
    • all procedures use MOVAPS
  2. size is not hardware friendly and address is not aligned.

Procedures were tested on following computers:

  • recent Core 2 Due E8200,
  • quite old Pentium M (instruction PSHUFB not available).

Quick results discussion:

  • As always speedup depends on table size — for larger tables speedup is also larger. Max speedup:
    • Core 2:
      • 15.5 timesPSHUFB unrolled & MOVAPS
      • 3.5 times — PSHUFB & MOVUPS
    • Pentium M:
      • 4 times — BSWAP unrolled
  • Unaligned memory access kills performance - results clearly shows this behaviour.

Results for Core 2

Results

Core 2 - tables aligned, use MOVUPS for memory transfers

Results

Core 2 - tables aligned, use MOVAPS for memory transfers

Results

Core 2 - tables unaligned

Results for Pentium M

Results

Pentium M - tables aligned, use MOVUPS for memory transfers

Results

Pentium M - tables aligned, use MOVAPS for memory transfers

Results

Pentium M - tables unaligned

Determining if an integer is a power of 2 [11.04.2010]

Method from Bit Twiddling Hacks: (x != 0) && (x & (x-1) == 0). GCC compiles this to following code:

; input/ouput: eax
; destroys: ebx

        test    %eax,  %eax     ; x == 0?
        jz      1f

        leal -1(%eax), %ebx     ; ebx := x-1
        test    %eax,  %ebx     ; ZF  := (eax & ebx == 0)

        setz     %al
        movzx    %al, %eax       ; eax := ZF
1:

We can use also BSF and BSR instructions, that determines position of first and last bit=1. If number is power of 2, then just one bit is set, and thus these positions are equal. BSx sets also ZF flag if input value is zero.

; input/output: eax
; destroys: ebx, edx

        bsf     %eax, %ebx      ; ebx := LSB's position if eax != 0, ZF = 1 if eax = 0
        jz      1f
        bsr     %eax, %edx      ; edx := MSB's position

        cmp     %ebx, %edx      ; ZF  := (ebx = edx)

        setz    %al
        movzx   %al, %eax       ; eax := ZF
1:

Brenchless conditional exchange [8.04.2010]

Suppose we have to exchange (or just move) two registers A and B:

  1. C := A xor B
  2. C := 0 if condition is not true
  3. A := A xor C
  4. B := B xor C

If C is 0, then A and B left unchanged, else A and B are swapped. If only conditional move from B to A is needed, then step 4th have to be skipped.

Here is a sample x86 code, where condition is value of CF:

sbb edx, edx ; part of step 2. - edx = 0xffffff if CF=1, 0x000000 otherwise
mov ecx, eax
xor ecx, ebx ; step 1
and ecx, edx ; completed step 2. - now C is 0 or (A xor B)
xor eax, ecx ; step 3
xor ebx, ecx ; step 4

Branchless moves are possible in Pentium Pro and higher with instructions cmovcc.

See also XOR linked list.

STL: map with string as key - access speedup [3.04.2010]

The idea is quite simple: we do not have a single stl::map<string, something>, but vector of maps, indexed with O(1) time — each map store keys sharing certain properties. Drawback: additional memory.

I've tested following grouping schemes:

  1. length of string,
  2. first letter of string (one level trie),
  3. both length and first letter.

Third is the fastest — around 60% faster then plain std::map from gcc (red-black tree).

Tests: my program read plain text (I've used The Illiad from http://gutenberg.org), text is splitted into words (~190000) and then each words is inserted into dictonary (~28000 distinct words); then the same words are searched in dictionaries. Table below summarizes results on my computer (gcc 4.3.4 from cygwin).

data struct running time [ms] speedup [%]
min avg max min avg max
inserting
std::map 269 287 355 100 100 100
first char 218 241 395 81 84 111
length 218 240 345 81 84 97
len./char 165 172 207 61 60 58
finding
std::map 295 322 483 100 100 100
first char 243 263 460 82 82 95
length 238 248 292 80 77 60
len./char 184 190 241 62 60 50

Download test program.

Branchless signum [1.04.2010]

Problem: calculate value of sign(x):

My solution do not involve any hardaware specific things like ALU flags nor special instructions — just plain AND, OR, shifts.

; input: eax = X

movl %eax, %ebx
sarl $31, %eax  // eax = -1 if X less then zero, 0 otherwise

andl $0x7fffffff, %ebx
addl $0x7fffffff, %ebx // MSB is set if any lower bits were set
shrl $31, $ebx  // eax = +1 if X greater then zero, 0 otherwise

orl %ebx, %eax  // eax = result

C99 implementation:

int32_t sign(int32_t x) {
        int32_t y;
        y = (x & 0x7fffffff) + 0x7fffffff;
        return (x >> 31) | ((uint32_t)y >> 31);
}

Fill word with selected bit [1.04.2010]

This is continuation of subproblem from previous post: we have a word (byte, dword, whatever) and want to fill it with selected bit.

The most general algorithm

  1. mask bit
    [10111010] => [00010000]
  2. clone word
    a=[00010000], b=[00010000]
  3. shift bit in first word to MSB, and to LSB in second word
    a=[10000000], b=[00000001]
  4. subtract c = a - b
    c=[01111111]
  5. add missing MSB c = c OR a
    c=[11111111]
uint32_t fill1(uint32_t a, int bit) {
        uint32_t b;

        b = a = a & (1 << bit);

        a <<= 31 - bit;
        b >>= bit;

        return (a - b) | a;
}

If arithmetic shifts are supported

  1. shift bit to MSB
    a=[10000000]
  2. arithmetic shift right
    a=[11111111]
uint32_t fill2(uint32_t a, int bit) {
        a <<= 31 - bit;
        return (int32_t)(a) >> 31;
}

386 processor specific

On processor 386+ we can clone carry flag (CF) with sbb reg, reg and with bt reg, reg copy selected bit from reg to CF.

uint32_t fill386(uint32_t a, int bit) {
        uint32_t result;
        __asm__ __volatile__ (
                "bt  %1, %0\n"
                "sbb %0, %0\n"
                : "=r" (result)
                : "r" (bit), "0" (a)
        );
        return result;
}

Transpose bits in byte using SIMD instructions [31.03.2010]

Method presented here allows to get any bit permutation, transposition is just one of possible operations. Lookup-based approach would be faster, but algorithm is worth to (re)show.

Algorithm outline for 8-byte vector (with SSE instruction it is possible to get 2 operations in parallel):

  1. fill vector with given byte
    [11010001] => [11010001|11010001|11010001|11010001|11010001|11010001|11010001|11010001]
  2. leave one bit per byte
    [10000000|01000000|00000000|00010000|00000000|00000000|00000000|00000001]
  3. perform desired transposition ("move" bits around)
    [00000001|000000010|00000000|00001000|00000000|00000000|00000000|10000000]
  4. perform horizontal OR of all bytes
    [10001011]

Here is my old MMX code (polish text); below SSE/SSE5 implementation details.

Ad 1. Series of punpcklbw/punpcklwb/shufps or pshufb if CPU supports SSSE3.

# 1.1
movd       %eax, %xmm0
shufps    $0x00, %xmm0, %xmm0
punpcklbw %xmm0, %xmm0
punpcklwd %xmm0, %xmm0

# 1.2
pxor      %xmm1, %xmm1
movd       %eax, %xmm0
pshufb    %xmm1, %xmm0

Ad 2. Simple pand with mask packed_qword(0x8040201008040201).

pand  MASK1, %xmm0

Ad 3. If plain SSE instructions are supported this step require some work. First each bit is populated to fill whole byte (using pcmpeq — we get negated result), then mask bits on desired positons.

SSE5 has powerful instruction protb that can do perform rotation of each byte with independent amount — so in this case just one instruction is needed.

# SSE
pcmpeq  %xmm1, %xmm0
pandn   MASK2, %xmm0    # pandn - to negate

# SSE5
protb    ROT, %xmm0, %xmm0

Ad 4. Since bits are placed on distinct positions, we can use instruction psadbw, that calculate horizontal sums of bytewide differences from two registers (separately for low and high registers half). If one register is full of zeros, we get sum of bytes from other register.

psadbw  %xmm1, %xmm0
movd    %xmm0, %eax

Depending on instruction set three (SSE) or two (SSE5) additional tables are needed.

PostgreSQL: get selected rows with given order [30.03.2010]

Suppose that database stores some kind of dictionary and user picks some items, but wants to keep order. For example dictionary has entries with id=0..10, and user picked 9, 2, 4 and 0. This simple query does the job (query splitted):

foo = SELECT (ARRAY[9,2,4,0])[i] AS index, i AS ord FROM generate_series(1, 4) AS i
SELECT * FROM dictionary INNER JOIN (foo) ON dictionary.id=foo.index ORDER BY foo.ord

Latency of PUSHA and POPA on Core2 [25.06.2008]

With simple program following times were measured:

Linux scheduler: int sched_getcpu() [21.06.2008]

Function defined in scched.h returns number of CPU that runs calling thread or process.

SSE: conversion uint32 to float [18.06.2008]

There is no such instruction — CVTDQ2PS converts signed 32-bit ints. Solution: first zero MSB, such number is never negative in U2, so mentioned instruction could be used. Then add 232 if MSB is set.

Main procedure from sample program:


float    CONST[4]     SIMD_ALIGN = packed_float((float)((uint32_t)(1 << 31))); /* 2^31 */
uint32_t MASK_0_30[4] SIMD_ALIGN = packed_dword(0x7fffffff);
uint32_t MASK_31[4]   SIMD_ALIGN = packed_dword(0x80000000);


void convert_uint32_float(uint32_t in[4], float out[4]) {
        __asm__ volatile (
        "movdqu   (%%eax), %%xmm0  \n"
        "movdqa    %%xmm0, %%xmm1  \n"

        "pand   MASK_0_30, %%xmm0  \n" // xmm0 - mask MSB bit - never less then zero in U2
        "cvtdq2ps  %%xmm0, %%xmm0  \n" // convert this value to float

        "psrad        $32, %%xmm1  \n" // populate MSB in higher word (enough to mask CONST)
        "pand       CONST, %%xmm1  \n" // xmm1 = MSB set ? float(2^31) : float(0)

        "addps     %%xmm1, %%xmm0  \n" // add 2^31 if MSB set

        "movdqu    %%xmm0, (%%ebx) \n"

        : /* no output */
        : "a" (in),
          "b" (out)
        );
}

Download: sse-uint32-float.c, sse-aux.c

PABSQ — absolute value of two singed 64-bit numbers [8.06.2008]

Branch-less x86 code:

movl  %eax, %ebx
sarl   $32, %ebx        ; fill ebx with sign bit
xorl  %ebx, %eax        ; negate eax (if negative)
subl  %ebx, %eax        ; increment eax by 1 (if negative)

SSE2:

pshufd $0b11110101, %xmm0, %xmm1        ; populate dwords 3 and 1
psrad   $32, %xmm1      ; fill quad words with sign bit
pxor  %xmm1, %xmm0      ; negate (if negative)
psubq %xmm1, %xmm0      ; increment (if negative)

RDTSC on Core2 [8.06.2008]

RDTSC is incremented with bus-clock cycles, and then multiplied by core-clock/bus-clock ratio. From programmer view, RDTSC counter is incremented by value greater then 1, for example on C2D E8200 it is 8.

Latency of RDTSC in Pentium4 is about 60-120 cycles, on AMD CPU around 6 cycles.

GCC asm constraints [7.06.2008]

Read-write variables

asm(
        "..."
        : "+a" (var)
);

Read-only variables, registers are clobbered

This won't work, GCC complains:

asm(
        "..."
        : /* no output */
        : "a" (var)
        : "eax"
);

We can declare temporary variable, and treat as read-write one:

int tmp_var = var;
asm(
        "..."
        : "+a" (tmp_var)
);

If there are more registers, or var shouldn't be changed we can declare common dummy variable:

int dummy __attribute__((unused));
asm(
        "..."
        : "=a" (dummy),
          "=b" (dummy),
          "=c" (dummy)
        : "a" (var_or_value1),
          "b" (var_or_value2),
          "c" (var_or_value2)
);