Base64 encoding & decoding using AVX512BW instructions

Author:Wojciech Muła
Added on:2016-04-03
Updated on:2017-05-01 (add info about VPMULTISHIFTQB from AVX512VL) 2016-12-21 (fix pack and unpack procedures) 2016-11-21 (change the title, simplified "faster procedure", added link to the AVX512F article)

Contents

Introduction

The SIMD versions of base64 conversion algorithms are described in Base64 encoding with SIMD instructions and Base64 decoding with SIMD instructions. I also described realization of both encoding and decoding using AVX512F (Foundation) instructions.

AVX512BW (Byte & Word) will come with a great number of new instructions; following instructions seem perfect to solve base64-related problems:

Encoding

SIMD encoding consist following steps:

  1. In each step 48 bytes are loaded (16 * 24 bit).
  2. Split each 24-bit words into separate 32-bit lanes.
  3. In each 32-bit lane move 6-bit words to separate bytes.
  4. Convert 6-bit indices to ASCII, according to base64 lookup.

Steps 1 & 2 — Loading data and splitting bytes

In a SSE code loading data and splitting 24-bit words into 32-bit lanes is done by pshufb instruction. In AVX2 the instruction operates on 128-bit lanes, making this step more complicated. But AVX512BW vpermb operates on the whole ZMM register, so loading step is as simple as in the SSE version.

// load 48 bytes
// v = [...|DDDC|CCBB|BAAA]
const __m512i v = _mm512_loadu_si512(reinterpret_cast<const __m512i*>(input + i));

// split 24-bit words to 32-bit lanes
// in = [...|0DDD|0CCC|0BBB|0AAA]
const __m512i in = _mm512_permutexvar_epi8(shuffle_input, v);

Step 3 — moving 6-bit word to sperate bytes

Input order of fields is following:

[????????|ccdddddd|bbbbcccc|aaaaaabb]

and the expected output is:

[00dddddd|00cccccc|00bbbbbb|00aaaaaa]

Standard SIMD procedure

Below is an AVX512F version from the article linked above. Lack of byte-level instructions makes the procedure pretty complicated.

template <int shift, uint32_t mask>
__m512i merge(__m512i target, __m512i src) {
    __m512i shifted;
    if (shift > 0) {
        shifted = _mm512_srli_epi32(src, shift);
    } else {
        shifted = _mm512_slli_epi32(src, -shift);
    }

    return _mm512_ternarylogic_epi32(_mm512_set1_epi32(mask), shifted, target, 0xca);
}

__m512i unpack(const __m512i in) {
    // [00000000|00000000|00000000|00aaaaaa]
    __m512i indices = _mm512_and_si512(_mm512_srli_epi32(in, 2), packed_dword(0x0000003f));

    // [00000000|00000000|00BB0000|00aaaaaa]
    indices = merge<-12, 0x00003000>(indices, in);

    // [00000000|00000000|00BBbbbb|00aaaaaa]
    indices = merge<  4, 0x00000f00>(indices, in);

    // [00000000|00CCCC00|00BBbbbb|00aaaaaa]
    indices = merge<-10, 0x003c0000>(indices, in);

    // [00000000|00CCCCcc|00BBbbbb|00aaaaaa]
    indices = merge<  6, 0x00030000>(indices, in);

    // [00dddddd|00CCCCcc|00BBbbbb|00aaaaaa]
    indices = merge< -8, 0x3f000000>(indices, in);

    return indices;
}

Faster procedure (AVX512BW)

Unpacking could be performed faster with help of vpsllvw instruction. This require change in the 32-bit lane format from:

[????????|ccdddddd|bbbbcccc|aaaaaabb]

to:

[bbbbcccc|ccdddddd|aaaaaabb|bbbbcccc]
 ^^^^                           ^^^^
 unused bits             unused bits

It does require only different vector passed to vpermb used for splitting bytes

Algorithm:

  1. Isolate fields a and c.
// t0    = [0000cccc|cc000000|aaaaaa00|00000000]
const __m512i t0 = _mm512_and_si512(in, _mm512_set1_epi32(0x0fc0fc00));
  1. Shift right the field a by 10 bits, and the field c by 6 bits.
// t1    = [00000000|00cccccc|00000000|00aaaaaa]
const __m512i t1 = _mm512_srlv_epi16(t0, _mm512_set1_epi32(0x0006000a));
  1. Shift left the field b by 4 bits, and the field d by 8 bits (note that no masking is done.)
// t2    = [ccdddddd|00000000|aabbbbbb|cccc0000]
const __m512i t2 = _mm512_sllv_epi16(in, _mm512_set1_epi32(0x00080004));
  1. Finally write the selected bits from the vector t2 into t1.
//         = [00dddddd|00cccccc|00bbbbbb|00aaaaaa]
const __m512i indices = _mm512_ternarylogic_epi32(_mm512_set1_epi32(0x3f003f00), t2, t1, 0xca);

The procedure costs just two shifts and two bitwise operations.

Faster procedure (AVX512VL) new

AVX512VL defines instruction vpmultishiftqb, that may replace all variable shift instructions from the previous point. Please note that the layout of 32-bit lanes require the same modification as described in the previous point.

The instruction builds a vector of bytes from octets located at any position in a quadword. Following psudocode shows the algorithm:

for i in 0 .. 7 loop
    qword := input.qword[i];

    for j in 0 .. 7 loop
        index := indices.byte[i * 8 + j];
        output.byte[i * 8 + j] = rotate_right(qword, index) and 0xff;
    end loop
end loop

Although vpmultishiftqb produces a vector of bytes and encoding needs just 6 lower bits, no masking is needed. The instruction vpermb (described below) does masking internally.

Code snippet shows the proper parameters for vpmultishiftqb.

// after multishift a single 32-bit lane has following layout:
// [bbbbcccc|bbcccccc|aabbbbbb|ddaaaaaa],
// i.e.: (a = [10:17], b = [4:11], c = [22:27], d = [16:21])

const __m512i shifts  = packed_qword(0x3036242a1016040alu); // 48, 54, 36, 42, 16, 22, 4, 10
const __m512i indices = _mm512_multishift_epi64_epi8(shifts, in);

Step 4 — converting to ASCII

The last part of the algorithm is converting, in parallel, all 6-bit indices into ASCII codes. The lookup table for this conversion has 64 bytes and this is the size of an AVX512 register. This can be done using a single invocation of the already introduced instruction vpermb.

const __m512i result = _mm512_permutexvar_epi8(indices, lookup);

Decoding

SIMD decoding consists following steps:

  1. Translate in parallel from ASCII into 6-bit values saved on separate bytes. At this stage error detection is performed.
  2. Pack 6-bit data into continuous bit stream — the result has 48 bytes.

Step 1 — translation from ASCII

This step is a perfect place to utilize the instruction vpermi2b, it requires three registers:

  • indices,
  • the lower & higher halves of a 128-item lookup table.

Valid input characters in base64 are always standard ASCII, so they never have set the most significant bit (MSB). Thanks to that the seven lowest bits of the input could be directly used as indices for vpermi2b. (The instruction simply ignores MSB, so no masking is required.)

The lookup table has to be precalculated. It translates from an ASCII code to 6-bit data or invalid character marker. The marker value is 0x80. Thanks to that both extended ASCII and invalid characters could be easily identified in one step.

__m512i lookup(const __m512i input) {

    const __m512i lookup_0 = precalc::lookup_0;
    const __m512i lookup_1 = precalc::lookup_1;

    const __m512i translated = _mm512_permutex2var_epi8(lookup_0, input, lookup_1);

    const uint64_t mask = _mm512_movepi8_mask(translated | input); // convert MSBs to the mask
    if (mask) {
        report error;
    }

    return translated;
}

Step 2 — Packing bit-fields

The final stage of decoding is packing all 64 bit fields into a continues 48 bytes. It is done in two steps:

  1. Pack four fields within 32-bit words into 24-bit words.
  2. Move these 3-byte words into continuous array.

The first step is a direct translation from SSE code. It uses multiply-add instruction that does two shifts and one bitwise or.

// input:  [00dddddd|00cccccc|00bbbbbb|00aaaaaa]

// merge:  [0000cccc|ccdddddd|0000aaaa|aabbbbbb]
const __m128i merge_ab_and_bc = _mm_maddubs_epi16(values, packed_dword(0x01400140));

// result: [00000000|aaaaaabb|bbbbcccc|ccdddddd]
return _mm_madd_epi16(merge_ab_and_bc, packed_dword(0x00011000));

The second step use just one vpermb.

Summary

Sample code

Repository contains implementations of both encoding and decoding procedures.