Author: | Wojciech Muła |
---|---|

Added on: | 2008-06-21 |

Updated on: | 2016-05-03 (Dmitry Petrov noticed that alpha can have better resolution; new SWAR code; updated results for Core i5)
2016-03-02 (results from Core i5),
2015-12-26 (instruction pmaddubsw was introduced by
SSSE3, not SSE4; thanks a lot Harold!) |

Contents

Image crossfading is a kind of alpha blending where a final pixel is the result of linear interpolation of pixels from two images:

result_pixel = pixel1 * alpha + pixel2 * (1 - alpha)

where alpha lie in range [0, 1]. Of course when operating on "pixels" color components are considered; components are unsigned bytes.

SSSE3 introduced instruction `PMADDUBSW`. This instruction multiply
a destination vector of **unsigned** **bytes** by a source vector of
**signed** **bytes** — the result is a vector of signed words. Then
adjacent words are added with **signed** saturation (the same operation
as `PHADDSW`).

This is exactly what crossafading needs.

The obvious drawback is that instruction operates on signed values.
Because `alpha` must be positive, this reduces resolution of alpha from 8
to 7 bits. (was: *Because multiplication results are signed and then added,
the sum must not be greater than 32767 — this requirement reduces
resolution by another bit. Finally alpha must lie in range [0..63].*)
Dmitry Petrov pointed out that alpha can be a 7-bit value, as such
value never cause an overflow. Let's assume that both `pixel1` and
`pixel2` have maximum value, and check if following inequality is true:

(1) 255 * alpha + 255 * (127 - alpha) < 2^15 - 1 (2) 255 * 127 < 2^15 - 1 (3) 32385 < 32767

Obviously the inequality is true.

Prepare constant vector of

`127*alpha`/`127*(1-alpha)`:xmm6 = packed_byte(alpha, 127-alpha, alpha, 127-alpha, ..., alpha, 127-alpha)

Load 16 components from images X and Y:

movdqa (%eax), %xmm0 // xmm0 = packed_byte(rX1, gX1, bX1, _, rX2, gX2, bX2, _, ...) movdqa (%ebx), %xmm1 // xmm1 = packed_byte(rY1, gY1, bY1, _, rY2, gY2, bY2, _, ...)

Interleave components:

movdqa %xmm0, %xmm2 punpcklbw %xmm1, %xmm0 // xmm0 = packed_byte(rX1, rY1, gX1, gY1, bX1, bY2, ...) punpcklbw %xmm1, %xmm2 // xmm2 = packed_byte(rX8, rY8, gX10, gY10, bX11, bY11, ...)

Interpolate components with

`PMADDUBSW`:pmaddubsw %xmm6, %xmm0 // xmm0 = packed_byte(127*((rX1 * alpha) + rY1*(1 - alpha)), ...) pmaddubsw %xmm6, %xmm2 // xmm2 = packed_byte(127*((rX8 * alpha) + rY8*(1 - alpha)), ...)

Divide by 64 — now all words lie in range

`[0..255]`:psrlw $16, %xmm0 // xmm0 = packed_byte((rX1 * alpha) + rY1*(1 - alpha), ...) psrlw $16, %xmm2 // xmm2 = packed_byte((rX8 * alpha) + rY8*(1 - alpha), ...)

Pack words to bytes and save result:

packuswb %xmm2, %xmm0 movdqa %xmm0, (%ecx)

goto 2

Crossfading using SWAR approach is possible and on 64-bit machines is pretty fast. The core SWAR idea is to perform 4 multiplications using single multiplication instruction. When a register has following byte layout:

[00aa|00bb|00cc|00dd]

then multiplication by value in range [0..255] yields a vector of four 16-bit values.

Load four pixels:

A = imageA[i] [a1|b1|g1|r1|a0|b0|g0|r0] B = imageB[i] [a3|b3|g3|r3|a2|b2|g2|r2]

Isolate odd and even components (_ denotes 0):

A_lo = A & 0x00ff00ff00ff00ff [__|b1|__|r1|__|b0|__|r0] A_hi = (A >> 8) & 0x00ff00ff00ff00ff [__|a1|__|g1|__|a0|__|g0] B_lo = B & 0x00ff00ff00ff00ff [__|b3|__|r3|__|b2|__|r2] B_hi = (B >> 8) & 0x00ff00ff00ff00ff [__|a3|__|g3|__|a2|__|g2]

Multiply by

`alpha`and`255 - alpha`(values in range [0..255]):A_lo *= alpha [b1*a |r1*a |b0*a |r0*a ] A_hi *= alpha [a1*a |g1*a |a0*a |g0*a ] B_lo *= 255 - alpha [b3*a'|r3*a'|b2*a'|r2*a'] B_hi *= 255 - alpha [a3*a'|g3*a'|a2*a'|g2*a']

Add components, no overflow is possible. Now each components is premultiplied by 256:

R0 = A_lo + B_lo [b1+b3|r1+r3|b0+b2|r0+r2] R1 = A_hi + B_hi [a1+a3|g1+bg|a0+a2|g0+g2]

Divide by 256 and set proper byte position:

R0 = (R0 >> 8) & 0x00ff00ff00ff00ff [__|b1|__|r1|__|b0|__|r0] R1 = R1 & 0xff00ff00ff00ff00 [a1|__|g1|__|a0|__|g0|__]

Save the result:

data[i] = R0 | R1 [a1|b1|g1|r1|a0|b0|g0|r0]

Number of operations per two pixels:

- bit-and: 6,
- bit-or: 1,
- add: 2,
- mul: 4,
- shift: 3.

Sample program contains following procedures:

`x86`— C implementation (`blend`);`SSE4`— SSE approach: first components are extended from byte to word, then multiply with`PMULUHW`and finally packed back (`SSE4_blend`); I called it SSE4 because for extending instruction`PMOVZXBW`is used, however this task could be accomplished with an old good`PUNPCKxBW`;`SSE4-2`— implementation of above algorithm (`SSE42_blend`);`swar`— implementation of SWAR;`SSE(1)`— SSE translation of the SWAR algorithm;`SSE(2)`— SSE translation of the SWAR algorithm, with reduced numbers of shifts.

Program was compiled with following options:

gcc -O3 -Wall -pedantic -std=c99 mix_32bpp.c -o mix

Image 1024 x 768 pixels were crossfaded 100 times, total time is given

Results from Core2 Duo E8200 @ 2.6GHz.

procedure | time [us] | speedup |
---|---|---|

x86 | 745,702 | 1.00 |

SSE4 | 309,393 | 2.41 |

SSE4-2 | 309,167 | 2.41 |

- Speedup over the x86 code is around 2.4 times. However comparison shows that speed of both SSE procedures is equal.

Results from Core i5 M 540 @ 2.53GHz

procedure | time [us] | speedup |
---|---|---|

SSE(2) | 64,221 | 1.00 |

SSE(1) | 69,228 | 0.93 |

SSE4 | 55,530 | 1.16 |

SSE4-2 | 44,315 | 1.45 |

swar | 143,059 | 0.45 |

x86 | 206,044 | 0.31 |

- The SSE4-2 algorithm is almost 1.5 faster than good implementation of the naive algorithm (SSE(2)).
- The SWAR algorithm is almost 40% faster than the scalar version.

Results from Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

procedure | time [us] | speedup |
---|---|---|

SSE(2) | 31,727 | 1.00 |

SSE(1) | 33,684 | 0.94 |

SSE4 | 47,566 | 0.67 |

SSE4-2 | 20,939 | 1.52 |

swar | 71,826 | 0.44 |

x86 | 70,728 | 0.45 |

- As for Core i5, the SSE4-2 algorithm is almost 1.5 faster than SSE(2).
- But the SWAR algorithm is not faster than the scalar.

- And what is worth to note that
`gcc`invoked with the`-O3`switch produced quite fast x86 code. Without any optimization x86 code was almost 10 times slower! I am surprised, and in my private ranking of the best open source application`GCC`has gone up.