Author: | Wojciech Muła |
---|---|

Added on: | 2014-01-26 |

SSE provides not widely known control register, called **MXCSR**. This
register plays three roles:

- Controls calculations:
- flag "flush to zero" (described later);
- flag "denormals are zeros" (described later);
- rounding mode (not covered in this text);

- Allow to mask/unmask floating-point exceptions.
- Save information about floating-point errors — these flags are sticky, i.e. a programmer is responsible for clearing them.

Possible errors in SSE floating point calculations are:

- division by zero,
- underflow,
- overflow,
- operations on denormalized values,
- invalid operations (like square root of negative number, division zero by zero).

By default all invalid operations in SSE are masked, i.e. they are not
converted into hardware exceptions. When exceptions are unmasked, then
standard `SIGFPE` exception is raised.

**Important**: even if errors are masked, when erroneous situation
occurs then calculations' slowdown is significant. So if our program
slows down for unknown reason, it may be an error in SSE-related
code — for example we load "random" values to XMM registers.

Error flags in the MXCSR are always updated, regardless of which exceptions are reported.

The flag "flush to zero" forces result 0 on **underflow** or **denormal** errors,
and what is more important, these errors have **no impact** on calculations
speed.

For example in the sample loop underflow occurs, because we try to multiply
`FLT_MIN` by `FLT_MIN` (`FLT_MIN` = 2^{ − 127}) — the result
can't be represented in floating point.

float min_floats[4] = packed_float(FLT_MIN); void mulps_in_loop() { const int32_t iterations = 10000000; uint32_t dummy; __asm__ __volatile__( "movups min_floats, %%xmm0\n" "1:\n" "movaps %%xmm0, %%xmm1\n" "mulps %%xmm1, %%xmm1\n" "loop 1b\n" : "=c" (dummy) : "c" (iterations) ); }

Time to execute the loop on Core2 in 0.796s. When flag "flush to zero" is
set above loop completes in 0.023s — **50 times faster**.

A denormalized floating point number is a very small number of value
(0 + *fraction*)⋅2^{ − 126}. Such value appears, for example, when we
divide `FLT_MIN` by 2.

There is a little problem — if a result of some operation on normalized
numbers is a denormalized value **it's not a SSE error**. Error is
reported only when one of operands is **already denormalized**.

So, where is the problem? If a result is denormalized, speed is noticeable degraded but we can't detect the point where denormalization has occurred. This can be done only when denormalized value is used in subsequent calculations.

**MXCSR** has the flag "denormals are zeros", which forces 0 as result of an
operation where at least one operand is denormalized, but **do not prevent**
from obtaining a denormized result from operation on normalized values.

Let summarize this with following program:

- first
`FLT_MIN`is multiplied by 0.5 resulting in denormalized value; - then this value is added to 0.

float tiny_value[4] = packed_float(FLT_MIN); float large_divisor[4] = packed_float(0.5); float final_value[4]; void test_loop() { const int32_t iterations = 10000000; uint32_t dummy; __asm__ __volatile__( "1:\n" "movups tiny_value, %%xmm0\n" "movups large_divisor, %%xmm1\n" "pxor %%xmm2, %%xmm2\n" "mulps %%xmm1, %%xmm0\n" // FLT_MIN * 0.5 => denormalized number "addps %%xmm2, %%xmm0\n" // denormalized + 0.0 => denormal exception "loop 1b\n" "movups %%xmm0, final_value\n" : "=c" (dummy) : "c" (iterations) : ); }

- With default settings execution time is 1.841s and the final value is denormalized (5.877472e-39).
- With the flag "denormals are zero" execution time decreased to 0.858s (50% shorter) and — as we would expect — result is zero.
- With the flag "flush to zero" execution time decreased to 0.121s (85% shorter) and final value is also zero.

The test programs are available.