Author: | Wojciech Muła |
---|---|

Added on: | 2014-10-12 |

Contents

Naive parsing of unsigned decimal numbers can be coded as:

uint32_t parse(char* s) { uint32_t result = 0; while (*s) { const uint8_t digit = *s++; result = result * 10 + (digit - '0'); } return result; }

The procedure `parse` does not check validity of a string nor
check for overflow — these problems won't be discussed in
this text.

Processing single letter require just 3 operations:

- subtract,
- addition,
- multiplication by 10.

On x86 multiplication by 10 is cheap; multiplication
`x * 10` is equivalent to `x << 3 + x << 1`. Shift
by 8 is coded with `LEA` (it's executed by addressing
unit, not ALU), and shift by 1 is simple addition
or another `LEA`.

In SWAR version we'll process 4 letters at once. For shorter strings naive fallback have to be used or proper mask applied on dword.

The common phase of all SWAR versions is converting ASCII to digit values, this is simple subtraction.

input — 4 ASCII letters:

x = packed_byte(| '1' | '2' | '7' | '0' |)

subtract vector of '0':

x' = packed_byte(| '1' | '2' | '7' | '0' |) - packed_byte(| '0' | '0' | '0' | '0' |) = packed_byte(| 1 | 2 | 7 | 0 |)

Suppose we have just two digits `x` and `y` stored in separate
bytes:

w = packed_byte(| 0 | 0 | x | y |)

Value of `w` is `256*x + y`, but we want to get `10*y + x`.
By multiplication `w` by 2561 we get this value in the higher
byte. Magic? No, 2561 = 256 * 10 + 1, and:

- 1 — keeps
`x`on the original position; - 2560 — moves
`y`**multiplied by 10**on the second byte; - adds them together.

Proof:
(256 ⋅ *x* + *y*) ⋅ (1 + 10256) = (256 ⋅ *x* + *y*) + (655360 ⋅ *x* + 256 ⋅ 10 ⋅ *y*).
Now reject components less than 256 and greater than 65535:
(*x* ⋅ 256) + (*y* ⋅ 2560) = 256 ⋅ (10 ⋅ *y* + *x*).

Full expression:

v = ((w * 2561) >> 8) & 0xff

The same trick could be used to join 2-digit values `ab` & `cd`
stored on 1st and 3rd byte:

Z = packed_byte(| 0 | ab | 0 | cd |)

The "magic" multiplier is 6553601, result is `100*cd + ab`
saved in the higher word.

input — 4 digits:

t1 = packed_byte(| d | c | b | a |)

form two words suitable to make 2-digits values:

t2a = packed_byte(| 0 | 0 | b | a |) t2b = packed_byte(| 0 | 0 | d | c |)

form 2-digits values by multiplication

`t2{a,b}`by 2561:t3a = packed_byte(| ? | ? | 10*a + b | ? |) t3b = packed_byte(| ? | ? | 10*c + d | ? |)

prepare dword from

`t3{a,b}`to get 4-digits value:t4 = packed_byte(| 0 | 10*a + b | 0 | 10*c + d |)

form 4-digit value by multiplication

`t3`by 6553601:t5 = packed_word(| 1000*c + 100* d + 10*a + b | ? |)

extract higher word from

`t5`.

#define packed32(b) (uint8_t(b) * 0x01010101) uint32_t parse1(char* s) { const uint32_t input = *reinterpret_cast<uint32_t*>(s); const uint32_t t1 = input - packed32('0'); const uint32_t t2a = t & 0xffff; const uint32_t t2b = t >> 16; const uint32_t t3a = ((t2a * (2561)) >> 8) & 0xff; const uint32_t t3b = ((t2b * (2561)) >> 8) & 0xff; const uint32_t t4 = t3a | (t3b << 16); return (word * (6553601) >> 16); }

Another approach to making 2-digit values:

input — 4 digits:

t1 = | d | c | b | a |

mask 1st and 3rd digit:

t2 = | d | 0 | b | 0 |

move one digit left and multiply by 10:

t3 = | 0 | d*10 | 0 | b*10 |

This can be done without multiplication. We calculate

`(x * 10) >> 8`, multiplication is replaced by shifts`((x << 3) + (x << 1)) >> 8`, after simplify shifts:`(x >> (8-3))`+`(x >> (8-1))`.add

`t1`and`t3`:t4 = | d | c + d*10 | b | a + b*10 |

mask 1st & 3rd digits again:

t5 = | 0 | c + d*10 | 0 | a + b*10 |

convert

`t5`to 4-digits value.

uint32_t parse2(char* s) { const uint32_t input = bswap(*reinterpret_cast<uint32_t*>(s)); const uint32_t t1 = input - packed32('0'); const uint32_t t2 = t1 & 0xff00ff00; const uint32_t t3 = (t2 >> (8-1)) + (t2 >> (8-3)); const uint32_t t4 = t1 + t3; const uint32_t t5 = t4 & 0x00ff00ff; // reversed order, so multiplier is diffrent return (t5 * (100 + 65536)) >> 16; }

Another approach to making 2-digit values:

input — 4 digits:

t1 = | d | c | b | a |

multiply by 10:

t2 = | d*10 | c*10 | b*10 | a*10 |

move digits one position right (shift dword right by 8 bits):

t3 = | 0 | d*10 | c*10 | b*10 |

add

`t2`and`t3`:t4 = | d*10 | c + d*10 | b + c*10 | a + b*10 |

mask unneeded bytes 1 and 3:

t5 = | 0 | c + d*10 | 0 | a + b*10 |

Now word `t5` could be converted as in the version 2.

uint32_t parse3(char* s) { const uint32_t input = bswap(*reinterpret_cast<uint32_t*>(s)); const uint32_t t1 = input - packed32('0'); const uint32_t t2 = (t1 * 10) >> 8; const uint32_t t3 = t1 + t2; const uint32_t t4 = t3 & 0x00ff00ff; return (t4 * (100 + 65536)) >> 16; }

Number of operations required to convert 4 ASCII digits.

operation | naive | version 1 | version 2 | version 3 |
---|---|---|---|---|

bswap | 1 | 1 | ||

bit-or | 1 | |||

bit-and | 3 | 2 | 1 | |

shift | 5 | 3 | 2 | |

add | 4 | 2 | 1 | |

sub | 1 | 1 | 1 | 1 |

mult | 3 | 1 | 1 | |

mult-by-10 | 4 | 1 | ||

total |
9 | 13 | 10 | 6 |

Sample code is available on gitub.