| Author: | Wojciech Muła |
|---|---|
| Added on: | 21.06.2008 |
Image crossfading is a kind of alpha blending where final pixel is result of linear interpolation of pixels from two images:
result_pixel = pixel1 * alpha + pixel2 * (1 - alpha)
where alpha lie in range [0, 1]. Of course when operating on "pixels" color components are considered; components are unsigned bytes.
SSE4.1 introduced instruction PMADDUBSW. This instruction multiply destination vector of unsigned bytes by source vector of signed bytes — result is vector of signed words. Then adjacent words are added with signed saturation (the same operation as PHADDSW).
This is exactly what crossafading needs.
The obvious drawback is that instruction operates on signed values. Thus alpha must be positive, resolution of alpha is reduced from 8 to 7 bits. Because multiplication results are signed and then added, sum must not be greater then 32767 — this requirement reduces resolution by one bit. Finally alpha must lie in range [0..63].
Prepare constant vector of 64*alpha/64*(1-alpha):
xmm6 = packed_byte(alpha, 64-alpha, alpha, 64-alpha, ..., alpha, 64-alpha)
Load 16 components from images X and Y:
movdqa (%eax), %xmm0 // xmm0 = packed_byte(rX1, gX1, bX1, _, rX2, gX2, bX2, _, ...) movdqa (%ebx), %xmm1 // xmm1 = packed_byte(rY1, gY1, bY1, _, rY2, gY2, bY2, _, ...)
Interleave components:
movdqa %xmm0, %xmm2 punpcklbw %xmm1, %xmm0 // xmm0 = packed_byte(rX1, rY1, gX1, gY1, bX1, bY2, ...) punpcklbw %xmm1, %xmm2 // xmm2 = packed_byte(rX8, rY8, gX10, gY10, bX11, bY11, ...)
Interpolate components with PMADDUBSW:
pmaddubsw %xmm6, %xmm0 // xmm0 = packed_byte(64*((rX1 * alpha) + rY1*(1 - alpha)), ...) pmaddubsw %xmm6, %xmm2 // xmm2 = packed_byte(64*((rX8 * alpha) + rY8*(1 - alpha)), ...)
Divide by 64 — now all words lie in range [0..255]:
psrlw $16, %xmm0 // xmm0 = packed_byte((rX1 * alpha) + rY1*(1 - alpha), ...) psrlw $16, %xmm2 // xmm2 = packed_byte((rX8 * alpha) + rY8*(1 - alpha), ...)
Pack words to bytes and save result:
packuswb %xmm2, %xmm0 movdqa %xmm0, (%ecx)
goto 2
Sample program mix_32bpp.c contains three procedures:
Program was compiled with following options:
gcc -O3 -Wall -pedantic -std=c99 mix_32bpp.c -o mix
and ran on Core2 Duo E8200 @ 2.6GHz under Linux control:
$ ./mix measure x86 100 function x86, called 100 times; image 1024 x 768 time = 745702 us $ ./mix measure sse4 100 function SSE4, called 100 times; image 1024 x 768 time = 309393 us $ ./mix measure sse4-2 100 function SSE4-2, called 100 times; image 1024 x 768 time = 309167 us
Speedup over x86 code is around 2.4 times. However comparison shows that speed of both SSE procedures are equal.
And what is worth to note gcc invoked with -O3 switch produced quite fast x86 code. Without any optimization x86 code was almost 10 times slower! I am surprised, and in my private ranking of the best open source application GCC has gone up.