Author: | Wojciech Muła |
---|---|

Added on: | 2014-10-15 |

Contents

SSE procedure is able to convert two 8-digit numbers. The main
instruction used in converting to numeric value is `PMADDWD`.

This instruction is specialized and I guess isn't often used.
`PMADDWD` operates on **packed words** and performs following
algorithm:

temp = array of 8 signed dwords -- 1. multiply 16-bit signed numbers vertically src = packed_word(...) dst = packed_word(...) for i in 0 .. 7 loop -- temp is 32-bit signed number temp[i] := signed_multiplication(src[i], dst[i]) end loop -- 2. add adjacent 32-bit words of temp and save result to src -- now src has type packed_dword for i in 0 .. 3 loop src[i] = temp[2 * i] + temp[2 * i + 1] end loop

Load 16 ASCII digits in order abcdefghijklmnop, and convert to bytes. Each byte a..p is in range 0..9:

t1 = packed_byte(| p | o | n | m | l | k | j | i | h | g | f | e | d | c | b | a |)) digit weight 10^0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10^16

Split even and odd digits:

t2_even = packed_byte(| 0 | o | 0 | m | 0 | k | 0 | i | 0 | g | 0 | e | 0 | c | 0 | a |)) t2_odd = packed_byte(| 0 | p | 0 | n | 0 | l | 0 | j | 0 | h | 0 | f | 0 | d | 0 | b |))

Calculate partial values of even digits:

t3_even = pmaddwd(t2_even, {1000, 10, 1000, 10, 1000, 10, 1000, 10}) = packed_dword(| 1000*m + 10*o | 1009*i + 10*k | 1000*e + 10*g | 1000*a + 10*c |)

as 4-digit decimals:

t3_even = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |)

Similarly calculate partial values of odd digit:

t4_odd = pmaddwd(t2_odd, {100, 1, 100, 1, 100, 1, 100, 1}) = packed_dword(| 100*n + p | 100*j + l | 100*f + h | 100*b + d |)

as 4-digit decimals:

t4_odd = packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |)

Add partial results from step 3 and 4 (shown as decimals):

t5 = t3_even + t4_odd = = packed_dword(| m0o0 | i0k0 | e0g0 | a0c0 |) + packed_dword(| 0n0p | 0j0l | 0f0h | 0b0d |) = packed_dword(| mnop | ijkl | efgh | abcd |)

Last step is to form 8-digits partial results. Since numbers are less than 10000, higher words of

`t5`are zero:t5 = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |) t6 = t5 | (t5 >> 16) = packed_dword(| 0000 | mnop | mnop | ijkl | ijkl | efgh | efgh | abcd |) ^^^^^^^^^^^ ^^^^^^^^^^^ these dwords are processed later

Calculate results:

t7 = pmaddwd(t6, {10000, 1, 0, 0, 10000, 1, 0, 0}) = packed_dword(| 0 | ijklmnop | 0 | abcdefgh |)

64-bit multiplication is available in SSE starting from SSE4.1, so portable way to get final 64-bit value requires a CPU multiplication.

In SSE4.1 conversion of **32-digit** numbers is very easy. String is split to
16-digit substrings and each substring is processed in steps 1 .. 5 of the
algorithm. After 5th step we have two values:

t5_lo = packed_dword(| 0000 | mnop | 0000 | ijkl | 0000 | efgh | 0000 | abcd |) t5_hi = packed_dword(| 0000 | MNOP | 0000 | IJKL | 0000 | EFGH | 0000 | ABCD |)

Steps 6 and 7 are different:

SSE4.1 supports

`PACKUSDW`, thanks to that`t5_lo`and`t5_hi`could be joined with single instruction:t6 = packusdw(t5_hi, t5_lo) = packed_word(| MNOP | IJKL | EFGH | ABCD | mnop | ijkl | efgh | abcd |)

Calculate results:

t7 = pmaddwd(t6, {10000, 1, 10000, 1, 10000, 1, 10000, 1}) = packed_dword(| IJKLMNOP | ABCDEFGH | ijklmnop | abcdefgh |)

SSE4.1 supports also

`PMULDQ`, so producing 64-bit values in SSE registers is possible:m1 = pmuldq(t7, {100000000, 100000000} = packed_qword(| ABCDEFGH00000000 | abcdefgh00000000 |) m2 = packed_dword(| 0 | t7[3] | 0 | t7[1] |) = packed_qword(| 0 | IJKLMNOP | 0 | ijklmnop |) t8 = m1 + m2 = packed_qword(| ABCDEFGHIJKLMNOP | abcdefgijklmnop |)

However calculating 128-bit value require CPU instructions.

Sample code is available on gitub. Only plain SSE implementation
is available (see file `parse.sse.c`).