| Author: | Wojciech Muła |
|---|---|
| Added on: | 29.04.2008 |
| Updated: | 24.05.2008 |
Instruction PSHUFB does parallel lookup from 16-byte array stored in XMM register — this is exactly what bin to hex conversion needs.
Code snippet showing the idea:
movdqa (%eax), %xmm0 ; xmm0 = {0xba, 0xdc, 0xaf, 0xe8, ...}
movdqa %xmm0, %xmm1 ; xmm1 -- bits 4..7 shifted 4 positions right
psrlw $4, %xmm1 ; xmm1 = {0xad, 0xca, 0xfe, 0x80, ...}
punpcklbw %xmm0, %xmm1 ; xmm0 = {0xba, 0xad, 0xdc, 0xca, 0xaf, 0xfe, 0xe8, 0x80, ...}
; MASK = packed_byte(0x0f)
pand MASK, %xmm1 ; xmm0 = {0xb0, 0xa0, 0xd0, 0xc0, 0xa0, 0xf0, 0xe0, 0x80, ...}
; -- bits 0..3
movdqa HEXDIGITS, %xmm0 ; HEXDIGITS = {'0', '1', '2', '3', ..., 'a', 'b', 'c', 'd', 'e', 'f'}
pshufb %xmm1, %xmm0 ; xmm0 = {'b', 'a', 'd', 'c', 'a', 'f', 'e', '8', ...}
hexprint.c is a test program that compares speed of presented method with three other lookup-based methods:
In a single iteration 100 x 16 bytes are decoded, and number of iterations is 100000.
Here are times measured on my Linux box, with Core 2 Duo E8200:
$ gcc -O3 hexprint.c -o hexprint $ time ./hexprint std1 > /dev/null real 0m0.785s user 0m0.780s sys 0m0.008s $ time ./hexprint std2 > /dev/null real 0m0.643s user 0m0.640s sys 0m0.004s $ time ./hexprint std3 > /dev/null real 0m0.642s user 0m0.640s sys 0m0.004s $ time ./hexprint ssse3 > /dev/null real 0m0.597s user 0m0.580s sys 0m0.016s
| method | user time | speedup | |
|---|---|---|---|
| std1 | 780 | 100% | ========== |
| std2 | 640 | 122% | ============ |
| std3 | 640 | 122% | ============ |
| sse3 | 580 | 133% | ============= |