16bpp/15bpp to 32bpp pixel conversions — different methods

Author:Wojciech Muła
Added on:1.06.2008

Basically this kind of conversion needs following steps:

R = (pixel16 and 0x001f) shl 3
G = (pixel16 and 0x07e0) shr 5
B = (pixel16 and 0xf800) shr 11

pixel32 = R or (G shl 8) or (B shl 16)

Since there aren't many pixels (32 or 64 thousand) lookup tables can be used. First approach is to use one big table indexed by pixels treated as natural numbers: this table has size 65536 * 4 bytes = 262144 bytes. Just one memory access is needed to get 32bpp pixel, however table size is large, and even if fit in L2 cache, memory latency kill performance.

pixel32 = LUT[pixel16]

Another approach needs two tables indexed by lower and higher byte of pixel, final pixel is result of bitwise or. These tables has size 2 * 256 * 4 bytes = 2048 bytes — perfectly fit in L1 cache.

pixel32 = LUT_hi[pixel16 shr 8] or LUT_lo[pixel16 and 0xff]

Sample program pixconv16bpp-32bpp.c includes different procedures:

Test results

Program was compiled with following flags:

gcc -O3 pixconv16bpp-32bpp.c -o test

gcc -O3 -DNONTEMPORAL pixconv16bpp-32bpp.c -o testnta

Here are timing from my Core 2 Duo E8200 @ 2.6GHz. Each procedure was called 10 times, results are average.

As we see the worst results have lookup16. Single lookup8 is a bit faster then naive implementation, lookup8(2) that reads 2 pixels in one iteration is almost 2 times faster. I think memory latencies can help to understand these results: L1 latency is small, 3 cycles or less, L2 cache — around 10-15 cycles.

SIMD versions are naturally much faster, however SSE2(2) is a bit slower then basic SSE2.

Image 320x200

procedure time [us] speedup  
naive 16625 100% ==========
lookup8 13584 120% ============
lookup16 24438 65% ======
lookup8(2) 6175 270% ===========================
MMX 7862 210% =====================
SSE2 4103 405% ========================================
SSE2(2) 4604 360% ====================================

Image 640x480

procedure time [us] speedup  
naive 49371 100% ==========
lookup8 40177 120% ============
lookup16 73051 65% ======
lookup8(2) 18574 265% ==========================
MMX 23483 210% ==========================
SSE2 12703 390% =======================================
SSE2(2) 13716 360% ====================================

Image 800x600

procedure time [us] speedup  
naive 77634 100% ==========
lookup8 62893 120% ============
lookup16 115156 65% ======
lookup8(2) 28830 270% ===========================
MMX 36452 210% =====================
SSE2 19217 400% ========================================
SSE2(2) 21696 360% ====================================

Image 1024x768

procedure time [us] speedup  
naive 130867 100% ==========
lookup8 106543 120% ============
lookup16 205421 60% ======
lookup8(2) 48503 270% ===========================
MMX 62737 210% =====================
SSE2 37881 345% ==================================
SSE2(2) 44162 295% =============================

Comparison

  speedup
procedure 320x200 640x480 800x600 1024x768
naive 100% 100% 100% 100%
lookup8 120% 120% 120% 120%
lookup16 65% 65% 65% 60%
lookup8(2) 270% 265% 270% 270%
MMX 210% 210% 210% 210%
SSE2 405% 390% 400% 345%
SSE2(2) 360% 360% 360% 295%