| Author: | Wojciech Muła |
|---|
Table of contents
Suppose we need to get mask when nonnegative argument is greater then some constant value; in other words, we want to evaluate following expression:
if x > const_n then mask := 0xffffffff; else mask := 0x00000000;
Portable branchless solution:
The key to understand this trick is binary form of M: 0111..1111zzzz, where z is 0 or 1 depending on n value. When x is greater then n, then x + M has form 1000..000zzzz, because carry bit propagate through series of ones to k-th position of result.
Real world example — branchless converting hex digit to ASCII (M=0x7ffffff6 for k=31 and n=9).
; input: eax - hex digit
; output: eax - ASCII letter (0-9, A-F or a-f)
; destroys: ebx
andl 0xf, %eax
leal 0x7ffffff6(%eax), %ebx ; MSB(ebx)=1 when eax >= 10
sarl $31, %ebx ; ebx - mask
andl $7, %ebx ; ebx = 7 when eax >= 10 (for A-F letters)
;andl $39, %ebx ; ebx = 39 when eax >= 10 (for a-f letters)
leal '0'(%eax, %ebx), %eax ; eax = '0' + eax + ebx => ASCII letter
It is also possible to convert 4 hex digits in parallel using similar algorithm, but input data have to be correctly prepared. Moreover generating mask requires 3 instructon and one extra register (in scalar version just one arithmetic shift). I guess it wont be fast on x86, maybe this approach would be good for SIMD code, where similar code transforms more bytes at once.
; input: eax - four hex digits in form [0a0b0c0d]
; output: eax - four ascii letters
; destroys: ebx, ecx
leal 0x76767676(%eax), %ebx ; MSB of each byte is set when corresponding eax byte is >= 10
; (here: 0x7f - 9 = 0x76)
andl $0x80808080, %ebx
movl %ebx, %ecx
shrl $7, %ebx
subl %ebx, %ecx ; ecx - byte-wise mask
;andl $0x07070707, %ecx ; for ASCII letters A-F
andl $0x27272727, %ecx ; for ASCII letters a-f
leal 0x30303030(%eax, %ecx), %eax ; ecx - four ascii letters
See also: SSSE3: printing hex values (weird use of PSHUFB instruction)
In-place reversing is performed by following algorithm:
// n - table size
int i = 0; // index1
int j = n - 1; // index2
int c = n/2; // counter
int x, y;
while (c--)
x = table[i];
y = table[j];
table[i] = y;
table[j] = x;
i++;
j--;
c--;
}
We can use some CPU-specific instruction to speedup this algorithm.
Program can load part of table into registers, then swap bytes in register using one or more instructions and then store registers content at new position. If size of table is not multiply of register size, then few bytes in middle of a table have to be swapped using default algorithm.
Modified algorithm outline:
// n - table size
// k - register size - k=4 for BSWAP and k=16 for PSHUFB
int i = 0; // index1
int j = n - k - 1; // index2 !!!
int c = n/(2*k); // counter !!!
register char x[k], y[k]; // hardware registers
while (c--)
x = table[i]; // load k bytes
y = table[j]; // ...
hardware_swap(x); // reverse these bytes
hardware_swap(y); // ...
table[i] = y; // store k bytes
table[j] = x; // ..
i += k;
j -= k;
c--;
}
if (n % (2*k) != 0)
swap rest byte-by-byte
Function hardware_swap(x) is equivalent to single instruction: BSWAP reg or PSHUFB xmm; SSE2 code requires more instructions:
movaps %xmm0, %xmm1 psrlw $8, %xmm0 psllw $8, %xmm1 por %xmm1, %xmm0 ; swap adjacent bytes pshufd $0x4e, %xmm0, %xmm0 ; swap order of dwords pshuflw $0x1b, %xmm0, %xmm0 ; swap order of words in lower pshufhw $0x1b, %xmm0, %xmm0 ; .. and higher half of register
Test program contains several procedures that implement modified algorithm:
Important note: SSE implementations use safe MOVUPS instructions for transfers data from/to memory. This instruction is much slower then MOVAPS even if accessing aligned data!
Two scenarios of test were considered:
Procedures were tested on following computers:
Quick results discussion:
Method from Bit Twiddling Hacks: (x != 0) && (x & (x-1) == 0). GCC compiles this to following code:
; input/ouput: eax
; destroys: ebx
test %eax, %eax ; x == 0?
jz 1f
leal -1(%eax), %ebx ; ebx := x-1
test %eax, %ebx ; ZF := (eax & ebx == 0)
setz %al
movzx %al, %eax ; eax := ZF
1:
We can use also BSF and BSR instructions, that determines position of first and last bit=1. If number is power of 2, then just one bit is set, and thus these positions are equal. BSx sets also ZF flag if input value is zero.
; input/output: eax
; destroys: ebx, edx
bsf %eax, %ebx ; ebx := LSB's position if eax != 0, ZF = 1 if eax = 0
jz 1f
bsr %eax, %edx ; edx := MSB's position
cmp %ebx, %edx ; ZF := (ebx = edx)
setz %al
movzx %al, %eax ; eax := ZF
1:
Suppose we have to exchange (or just move) two registers A and B:
If C is 0, then A and B left unchanged, else A and B are swapped. If only conditional move from B to A is needed, then step 4th have to be skipped.
Here is a sample x86 code, where condition is value of CF:
sbb edx, edx ; part of step 2. - edx = 0xffffff if CF=1, 0x000000 otherwise mov ecx, eax xor ecx, ebx ; step 1 and ecx, edx ; completed step 2. - now C is 0 or (A xor B) xor eax, ecx ; step 3 xor ebx, ecx ; step 4
Branchless moves are possible in Pentium Pro and higher with instructions cmovcc.
See also XOR linked list.
The idea is quite simple: we do not have a single stl::map<string, something>, but vector of maps, indexed with O(1) time — each map store keys sharing certain properties. Drawback: additional memory.
I've tested following grouping schemes:
Third is the fastest — around 60% faster then plain std::map from gcc (red-black tree).
Tests: my program read plain text (I've used The Illiad from http://gutenberg.org), text is splitted into words (~190000) and then each words is inserted into dictonary (~28000 distinct words); then the same words are searched in dictionaries. Table below summarizes results on my computer (gcc 4.3.4 from cygwin).
| data struct | running time [ms] | speedup [%] | ||||
|---|---|---|---|---|---|---|
| min | avg | max | min | avg | max | |
| inserting | ||||||
| std::map | 269 | 287 | 355 | 100 | 100 | 100 |
| first char | 218 | 241 | 395 | 81 | 84 | 111 |
| length | 218 | 240 | 345 | 81 | 84 | 97 |
| len./char | 165 | 172 | 207 | 61 | 60 | 58 |
| finding | ||||||
| std::map | 295 | 322 | 483 | 100 | 100 | 100 |
| first char | 243 | 263 | 460 | 82 | 82 | 95 |
| length | 238 | 248 | 292 | 80 | 77 | 60 |
| len./char | 184 | 190 | 241 | 62 | 60 | 50 |
Download test program.
Problem: calculate value of sign(x):
My solution do not involve any hardaware specific things like ALU flags nor special instructions — just plain AND, OR, shifts.
; input: eax = X movl %eax, %ebx sarl $31, %eax // eax = -1 if X less then zero, 0 otherwise andl $0x7fffffff, %ebx addl $0x7fffffff, %ebx // MSB is set if any lower bits were set shrl $31, $ebx // eax = +1 if X greater then zero, 0 otherwise orl %ebx, %eax // eax = result
C99 implementation:
int32_t sign(int32_t x) {
int32_t y;
y = (x & 0x7fffffff) + 0x7fffffff;
return (x >> 31) | ((uint32_t)y >> 31);
}
This is continuation of subproblem from previous post: we have a word (byte, dword, whatever) and want to fill it with selected bit.
uint32_t fill1(uint32_t a, int bit) {
uint32_t b;
b = a = a & (1 << bit);
a <<= 31 - bit;
b >>= bit;
return (a - b) | a;
}
uint32_t fill2(uint32_t a, int bit) {
a <<= 31 - bit;
return (int32_t)(a) >> 31;
}
On processor 386+ we can clone carry flag (CF) with sbb reg, reg and with bt reg, reg copy selected bit from reg to CF.
uint32_t fill386(uint32_t a, int bit) {
uint32_t result;
__asm__ __volatile__ (
"bt %1, %0\n"
"sbb %0, %0\n"
: "=r" (result)
: "r" (bit), "0" (a)
);
return result;
}
Method presented here allows to get any bit permutation, transposition is just one of possible operations. Lookup-based approach would be faster, but algorithm is worth to (re)show.
Algorithm outline for 8-byte vector (with SSE instruction it is possible to get 2 operations in parallel):
Here is my old MMX code (polish text); below SSE/SSE5 implementation details.
Ad 1. Series of punpcklbw/punpcklwb/shufps or pshufb if CPU supports SSSE3.
# 1.1 movd %eax, %xmm0 shufps $0x00, %xmm0, %xmm0 punpcklbw %xmm0, %xmm0 punpcklwd %xmm0, %xmm0 # 1.2 pxor %xmm1, %xmm1 movd %eax, %xmm0 pshufb %xmm1, %xmm0
Ad 2. Simple pand with mask packed_qword(0x8040201008040201).
pand MASK1, %xmm0
Ad 3. If plain SSE instructions are supported this step require some work. First each bit is populated to fill whole byte (using pcmpeq — we get negated result), then mask bits on desired positons.
SSE5 has powerful instruction protb that can do perform rotation of each byte with independent amount — so in this case just one instruction is needed.
# SSE pcmpeq %xmm1, %xmm0 pandn MASK2, %xmm0 # pandn - to negate # SSE5 protb ROT, %xmm0, %xmm0
Ad 4. Since bits are placed on distinct positions, we can use instruction psadbw, that calculate horizontal sums of bytewide differences from two registers (separately for low and high registers half). If one register is full of zeros, we get sum of bytes from other register.
psadbw %xmm1, %xmm0 movd %xmm0, %eax
Depending on instruction set three (SSE) or two (SSE5) additional tables are needed.
Suppose that database stores some kind of dictionary and user picks some items, but wants to keep order. For example dictionary has entries with id=0..10, and user picked 9, 2, 4 and 0. This simple query does the job (query splitted):
foo = SELECT (ARRAY[9,2,4,0])[i] AS index, i AS ord FROM generate_series(1, 4) AS i SELECT * FROM dictionary INNER JOIN (foo) ON dictionary.id=foo.index ORDER BY foo.ord
With simple program following times were measured:
Function defined in scched.h returns number of CPU that runs calling thread or process.
There is no such instruction — CVTDQ2PS converts signed 32-bit ints. Solution: first zero MSB, such number is never negative in U2, so mentioned instruction could be used. Then add 232 if MSB is set.
Main procedure from sample program:
float CONST[4] SIMD_ALIGN = packed_float((float)((uint32_t)(1 << 31))); /* 2^31 */
uint32_t MASK_0_30[4] SIMD_ALIGN = packed_dword(0x7fffffff);
uint32_t MASK_31[4] SIMD_ALIGN = packed_dword(0x80000000);
void convert_uint32_float(uint32_t in[4], float out[4]) {
__asm__ volatile (
"movdqu (%%eax), %%xmm0 \n"
"movdqa %%xmm0, %%xmm1 \n"
"pand MASK_0_30, %%xmm0 \n" // xmm0 - mask MSB bit - never less then zero in U2
"cvtdq2ps %%xmm0, %%xmm0 \n" // convert this value to float
"psrad $32, %%xmm1 \n" // populate MSB in higher word (enough to mask CONST)
"pand CONST, %%xmm1 \n" // xmm1 = MSB set ? float(2^31) : float(0)
"addps %%xmm1, %%xmm0 \n" // add 2^31 if MSB set
"movdqu %%xmm0, (%%ebx) \n"
: /* no output */
: "a" (in),
"b" (out)
);
}
Download: sse-uint32-float.c, sse-aux.c
Branch-less x86 code:
movl %eax, %ebx sarl $32, %ebx ; fill ebx with sign bit xorl %ebx, %eax ; negate eax (if negative) subl %ebx, %eax ; increment eax by 1 (if negative)
SSE2:
pshufd $0b11110101, %xmm0, %xmm1 ; populate dwords 3 and 1 psrad $32, %xmm1 ; fill quad words with sign bit pxor %xmm1, %xmm0 ; negate (if negative) psubq %xmm1, %xmm0 ; increment (if negative)
RDTSC is incremented with bus-clock cycles, and then multiplied by core-clock/bus-clock ratio. From programmer view, RDTSC counter is incremented by value greater then 1, for example on C2D E8200 it is 8.
Latency of RDTSC in Pentium4 is about 60-120 cycles, on AMD CPU around 6 cycles.
asm(
"..."
: "+a" (var)
);
This won't work, GCC complains:
asm(
"..."
: /* no output */
: "a" (var)
: "eax"
);
We can declare temporary variable, and treat as read-write one:
int tmp_var = var;
asm(
"..."
: "+a" (tmp_var)
);
If there are more registers, or var shouldn't be changed we can declare common dummy variable:
int dummy __attribute__((unused));
asm(
"..."
: "=a" (dummy),
"=b" (dummy),
"=c" (dummy)
: "a" (var_or_value1),
"b" (var_or_value2),
"c" (var_or_value2)
);
Found on Wikipedia:
gcc -dM -E - < /dev/null