| Author: | Wojciech Muła |
|---|---|
| Added on: | 1.06.2008 |
Basically this kind of conversion needs following steps:
R = (pixel16 and 0x001f) shl 3 G = (pixel16 and 0x07e0) shr 5 B = (pixel16 and 0xf800) shr 11 pixel32 = R or (G shl 8) or (B shl 16)
Since there aren't many pixels (32 or 64 thousand) lookup tables can be used. First approach is to use one big table indexed by pixels treated as natural numbers: this table has size 65536 * 4 bytes = 262144 bytes. Just one memory access is needed to get 32bpp pixel, however table size is large, and even if fit in L2 cache, memory latency kill performance.
pixel32 = LUT[pixel16]
Another approach needs two tables indexed by lower and higher byte of pixel, final pixel is result of bitwise or. These tables has size 2 * 256 * 4 bytes = 2048 bytes — perfectly fit in L1 cache.
pixel32 = LUT_hi[pixel16 shr 8] or LUT_lo[pixel16 and 0xff]
Sample program pixconv16bpp-32bpp.c includes different procedures:
Program was compiled with following flags:
gcc -O3 pixconv16bpp-32bpp.c -o test gcc -O3 -DNONTEMPORAL pixconv16bpp-32bpp.c -o testnta
Here are timing from my Core 2 Duo E8200 @ 2.6GHz. Each procedure was called 10 times, results are average.
As we see the worst results have lookup16. Single lookup8 is a bit faster then naive implementation, lookup8(2) that reads 2 pixels in one iteration is almost 2 times faster. I think memory latencies can help to understand these results: L1 latency is small, 3 cycles or less, L2 cache — around 10-15 cycles.
SIMD versions are naturally much faster, however SSE2(2) is a bit slower then basic SSE2.
| procedure | time [us] | speedup | |
|---|---|---|---|
| naive | 16625 | 100% | ========== |
| lookup8 | 13584 | 120% | ============ |
| lookup16 | 24438 | 65% | ====== |
| lookup8(2) | 6175 | 270% | =========================== |
| MMX | 7862 | 210% | ===================== |
| SSE2 | 4103 | 405% | ======================================== |
| SSE2(2) | 4604 | 360% | ==================================== |
| procedure | time [us] | speedup | |
|---|---|---|---|
| naive | 49371 | 100% | ========== |
| lookup8 | 40177 | 120% | ============ |
| lookup16 | 73051 | 65% | ====== |
| lookup8(2) | 18574 | 265% | ========================== |
| MMX | 23483 | 210% | ========================== |
| SSE2 | 12703 | 390% | ======================================= |
| SSE2(2) | 13716 | 360% | ==================================== |
| procedure | time [us] | speedup | |
|---|---|---|---|
| naive | 77634 | 100% | ========== |
| lookup8 | 62893 | 120% | ============ |
| lookup16 | 115156 | 65% | ====== |
| lookup8(2) | 28830 | 270% | =========================== |
| MMX | 36452 | 210% | ===================== |
| SSE2 | 19217 | 400% | ======================================== |
| SSE2(2) | 21696 | 360% | ==================================== |
| procedure | time [us] | speedup | |
|---|---|---|---|
| naive | 130867 | 100% | ========== |
| lookup8 | 106543 | 120% | ============ |
| lookup16 | 205421 | 60% | ====== |
| lookup8(2) | 48503 | 270% | =========================== |
| MMX | 62737 | 210% | ===================== |
| SSE2 | 37881 | 345% | ================================== |
| SSE2(2) | 44162 | 295% | ============================= |
| speedup | ||||
|---|---|---|---|---|
| procedure | 320x200 | 640x480 | 800x600 | 1024x768 |
| naive | 100% | 100% | 100% | 100% |
| lookup8 | 120% | 120% | 120% | 120% |
| lookup16 | 65% | 65% | 65% | 60% |
| lookup8(2) | 270% | 265% | 270% | 270% |
| MMX | 210% | 210% | 210% | 210% |
| SSE2 | 405% | 390% | 400% | 345% |
| SSE2(2) | 360% | 360% | 360% | 295% |