AVX512F base64 coding and decoding

Author:Wojciech Muła
Added on:2016-09-17
Updated on:2017-01-25 (spelling), 2016-12-18 (fixed order of bytes according to the base64 standard)

Introduction

Both base64 coding and decoding algorithms can be vectorized, i.e. SIMD instructions can be utilized, gaining significant speed-up over plain, scalar versions. I've shown different vectorization approaches in a series of articles:

AVX512 is the recent extension to the Intel's ISA, unfortunately the extension is split into several subextensions. In the last article from the list I described usage of subextension AVX512BW (Byte-Word). But since AVX512BW is announced for 2018 there are no performance results.

However, in 2016 on the market has appeared processors having subextension AVX512F (Foundation). Among many advantages of AVX512F there is one serious problem: lack of instructions working at byte and word level. The minimum vector element's size is 32 bits.

This article is a study of base64 algorithms realisation with foundation instructions AVX512F, major contributions are:

The text is split into four parts:

As I don't want to repeat myself too much, please refer to the linked articles for other algorithms for SSE and AVX2 and their analysis.

2016-12-18 note: in the initial version of this text I wrongly assumed order of input words, Alfred Klomp noted that the standard imposes a specific order. Today's change fixes this error.


Before we start: SWAR within an AVX512 register

Lack of byte-level instructions forces to use SWAR techniques on 32-bit vector's elements. Following operations are needed in both encoding and decoding:

  1. unsigned compare for greater or equal, where one argument is a vector of bytes, and the another one is a vector of constants;
  2. build a mask based on comparison's result;
  3. add two vectors with wrap-around arithmetic (modulo 256).

In case of SSE, AVX2 and AVX512BW the points 1st & 2nd are performed by single instruction like pcmpgtb (SSE intrinsic _mm_cmpgt_epi8). Likewise, the equivalent of the 3rd operation is single instruction paddb (SSE intrinsic _mm_add_epi8).

Important note: input vectors for the core part of base64 algorithms contain bytes less than 128, i.e. the most significant bit isn't set.

Unsigned compare with constant

The SWAR trick of unsigned comparison for greater or equal is based on following equation:

x + 0x80 - constant = 0x80 + (x - constant)

The assumption is that both x and constant are less than 0x80. If x >= constant then the result of the expression is greater or equal to 0x80, less than 0x80 otherwise. It means that the most significant bit indicates the result of comparison.

Let see a sample AVX512F code which compares input vector with constant = 26.

const __uint8_t constant = 0x80 - 26; // 102 = 0x66

const __m512i x        = load data somehow
const __m512i constant = _mm512_set1_epi32(0x66666666);
const __m512i result   = _mm512_add_epi32(x, constant);

Such code requires following instructions:

  • build a constant vector (move constant to a GP register, vpbroadcastd the value to an AVX512 register);
  • add this vector and the input vector.

Building a mask

Once we have a result of comparison we can easily build:

  • 1-bit value,
  • 7-bit mask (_mm512_cmpge_mask7bit in sample codes),
  • 8-bit mask (_mm512_cmpge_mask8bit).

In case of base64 masks are used to conditionally set some constants, thus depending on a constant value masks of different width are needed.

A 1-bit value require just one shift right.

// left MSBs
const __mm512i MSB = _mm512_and_epi32(result, _mm512_set1_epi32(0x80808080));

// put MSB at the desired position, for example 3rd
const __mm512 mask = _mm512_srli_epi32(MSB, 8 - 3);

Building a 7-bit mask requires also a shift and a subtract.

// left MSBs
const __mm512i MSB = _mm512_and_epi32(result, _mm512_set1_epi32(0x80808080));

// mask = MSB - (MSB >> 7)

//        0x80 - 0x01 = 0x7f
//        0x00 - 0x00 = 0x00
const __mm512 mask = _mm512i_sub_epi32(MSB, _mm512_srli_epi32(MSB, 7));

An 8-bit mask is simply a 7-bit mask merged with the MSB.

// left MSBs
const __mm512i MSB = _mm512_and_epi32(result, _mm512_set1_epi32(0x80808080));

// mask = MSB | (MSB - (MSB >> 7))

//        0x80 | (0x80 - 0x01) = 0x80 | 0x7f = 0xff
//        0x00 | (0x00 - 0x00) = 0x00
const __mm512 tmp  = _mm512i_sub_epi32(MSB, _mm512_srli_epi32(MSB, 7));

Summing things up: building a mask always require masking MSB and shift. Depending on the mask's width it would require an additional bit-and or bit-or.

Adding modulo 256

The operation for a single 8-bit number is defined as follows:

(a + b) & 0xff

While adding two vectors of 8-bit numbers we must disallow propagating carry bits across byte boundaries. This requires two major steps:

  • add inputs having reset MSB in each byte — this will never cause the carry propagation (0x7f + 0x7f = 0xfe, obviously 0xfe <= 0xff);
  • combine MSB of inputs with the MSB of result to set final MSB to the correct value.

Pseudocode for 8-bit numbers.

uint8_t add(uint8_t a, uint8_t) {

    const uint8_t a_06 = a & 0x7f;
    const uint8_t b_06 = b & 0x7f;
    const uint8_t a_7  = a & 0x80;
    const uint8_t b_7  = b & 0x80;

    const uint8_t tmp = a_06 + b_06;

    const uint8_t sum = tmp ^ a_7 ^ b_7;

    return sum;
}

As it was said above, inputs for the algorithms are smaller than 128, thanks to that some masking is not needed and calculating the final sum is simpler. Following C++ procedure adds a vectors of 7-bit numbers and 8-bit numbers.

// returns packed (a[i] + b[i]) & 0xff
// assertion a[i] < 0x80, b[i] can be any value
__m512i _mm512_add_epu8(const __m512i a, const __m512i b) {

    const __m512i b06 = _mm512_and_si512(b, packed_byte(0x7f));
    const __m512i sum = _mm512_add_epi32(a, b06);

    // merge the 7th bit of b with sum
    return _mm512_ternarylogic_epi32(packed_byte(0x80), b, sum, 0x6a);
}

The procedure requires two logic instructions and one addition, it uses two constants.

C++ notes

Sample C++ codes use two macros: packed_dword and packed_byte. The first macro fills all 32-bit elements of vector with given value, the second one fills all bytes with a value. The macros are used just for better readability, both use intrinsic _mm512_set1_epi32.

Base64 encoding

Base64 encoding consists following steps:

  1. Load data: put 3-byte sequences into 4-byte words.
  2. Unpack 6-bit words into bytes in each 4-byte word.
  3. Translate all 6-bit words into ASCII codes according to the base64 table. The table contains big letters, small letters, digits, the plus and the slash character ('A'-'Z', 'a'-'z', '0'-'9', '+' and '/').
  4. Save the result.
img/base64/base64-avx512-encoding.png

Loading data

Gather-based

Since AVX512F lacks of byte-level shuffling, loading data can be accomplished with a gather instruction. The gather does a parallel lookup in given table, indices are stored in vector's elements. Gather is significantly faster than individual loads, especially when it touches a small number of cache lines. In this case single gather would touch at most two cache lines.

An input is sequence 48 bytes, or twelve 32-bit lanes. The layout of an input:

[PPPO|OONN|NMMM|LLLK|KKJJ|JIII|HHHG|GGFF|FEEE|DDDC|CCBB|BAAA]
  11   10    9    8    7    6    5    4    3    2    1    0

The result vector's layout:

[.PPP|.OOO|.NNN|.MMM|.LLL|.KKK|.JJJ|.III|.HHH|.GGG|.FFF|.EEE|.DDD|.CCC|.BBB|.AAA]
  15   14   13   12   11   10    9    8    7    6    5    4    3    2    1    0

The 4th byte of each lane is not important.

Vector of indices contains a sequence of multiplies of 3: [0, 3, 6, 9, 12, ... and so on]. Loading data is then performed by a single instruction; please note that the last argument of _mm512_i32gather_epi32, scale, is 1.

static const uint32_t input_offsets[16] = {
     0*3,  1*3,  2*3,  3*3,
     4*3,  5*3,  6*3,  7*3,
     8*3,  9*3, 10*3, 11*3,
    12*3, 13*3, 14*3, 15*3
};

const __m512i input_gather = _mm512_loadu_si512(reinterpret_cast<const __m512i*>(input_offsets));

for (...) {
        const __m512i in = _mm512_i32gather_epi32(input_gather, (const int*)(input), 1);

        // ...
}

Vectorized approach

Although the gather-based loading is pretty simple and straightforward, the instruction's latency is really huge — it's 15 cycles.

A vectorized approach loads 64 bytes, where only 48 bytes are used, and performs all shifting and shuffling within registers. AVX512F offers shuffling instruction vpermd (_mm512_permutexvar_epi32), which works at 32-bit words level and can move words across lanes.

The algorithm:

  1. Load 64 bytes:

    [PPPO|OONN|NMMM|LLLK|KKJJ|JIII|HHHG|GGFF|FEEE|DDDC|CCBB|BAAA]
      11   10    9    8    7    6    5    4    3    2    1    0
    

    The 48-byte array could be split into four 12-byte subarrays:

    t1 = [D2 D1 D0 C2|C1 C0 B2 B1|B0 A2 A1 A0] x 4
            word 2      word 1      word 0
    
                12-byte subarray
    
  2. Each 12-byte array is then placed in a separate 128-bit lane. This is done by _mm512_permutexvar_epi32. Note the order of words: 0, 1, 1, 2.

    t2 = [D2 D1 D0 C2|C1 C0 B2 B1|C1 C0 B2 B1|B0 A2 A1 A0] x 4
    
  3. Shift bytes C right by 16 bits (_mm512_mask_srli_epi64). Now bytes A and C are at correct places.

    t3 = [.. .. D2 D1|D0 C2 C1 C0|C1 C0 B2 B1|B0 A2 A1 A0] x 4
    
  4. Shift bytes D and B left by 8 bit (_mm512_slli_epi64). And now these bytes are also at the correct position.

    t4 = [.. D2 D1 D0|C2 C1 C0 ..|C0 B2 B1 B0|A2 A1 A0 ..] x 4
    
  5. Merge results from 3rd and 4th steps (_mm512_ternarylogic_epi32). The 3rd byte of each word will be omitted by a lookup procedure, thus no masking is needed.

    t5 = [.. D2 D1 D0|.. C2 C1 C0|C0 B2 B1 B0|.. A2 A1 A0] x 4
    

An excerpt from sample code.

const __m512i tmp1 = _mm512_loadu_si512(input + i);
const __m512i tmp2 = _mm512_permutexvar_epi32(
    _mm512_set_epi32(11, 10, 10, 9, 8, 7, 7, 6, 5, 4, 4, 3, 2, 1, 1, 0),
    tmp1
);
const __m512i tmp3 = _mm512_mask_srli_epi64(tmp2, 0xaa, tmp2, 16);
const __m512i tmp4 = _mm512_slli_epi64(tmp3, 8);
const __m512i tmp5 = _mm512_ternarylogic_epi64(_mm512_set1_epi64(0x00ffffff), tmp4, tmp3, BIT_MERGE);

Unpacking 6-bit words into bytes update

A single 32-bit words layout:

[????????|ccdddddd|bbbbcccc|aaaaaabb]

An expected output is:

[00dddddd|00cccccc|00bbbbbb|00aaaaaa]

Where a, b, c and d are bits of individual 6-bit words. The question mark denotes random, unimportant data.

img/base64/base64unpack.png

Unfortunately this step can't use any fancy AVX512F instruction, a series of shifts followed by a bit merge is the only solution. The bit merge is done by the ternary logic instruction.

template <int shift, uint32_t mask>
__m512i merge(__m512i target, __m512i src) {
    __m512i shifted;
    if (shift > 0) {
        shifted = _mm512_srli_epi32(src, shift);
    } else {
        shifted = _mm512_slli_epi32(src, -shift);
    }

    return _mm512_ternarylogic_epi32(_mm512_set1_epi32(mask), shifted, target, 0xca);
}

__m512i unpack(const __m512i in) {
    // [????????|ccdddddd|bbbbCCCC|aaaaaaBB]
    //           ^^       ^^^^^^^^       ^^
    //           lo        lo  hi        hi

    // [00000000|00000000|00000000|00aaaaaa]
    __m512i indices = _mm512_and_si512(_mm512_srli_epi32(in, 2), packed_dword(0x0000003f));

    // [00000000|00000000|00BB0000|00aaaaaa]
    indices = merge<-12, 0x00003000>(indices, in);

    // [00000000|00000000|00BBbbbb|00aaaaaa]
    indices = merge<  4, 0x00000f00>(indices, in);

    // [00000000|00CCCC00|00BBbbbb|00aaaaaa]
    indices = merge<-10, 0x003c0000>(indices, in);

    // [00000000|00CCCCcc|00BBbbbb|00aaaaaa]
    indices = merge<  6, 0x00030000>(indices, in);

    // [00dddddd|00CCCCcc|00BBbbbb|00aaaaaa]
    indices = merge< -8, 0x3f000000>(indices, in);

    return indices;
}

Number of constants: 6.

Number of instructions:

  • bit-and: 1
  • ternary logic: 5,
  • shift: 6.

Total: 12 instructions.

Vectorized lookup — incremental

Introduction

A lookup procedure considers five ranges assigned to characters: 'A'-'Z', 'a'-'z', '0'-'9', '+' and '/'. The procedure associates a range with shift value, a constant that added to the input value gives a proper ASCII code.

The simplest algorithm incrementally calculates shift, pseudocode shows the idea.

char lookup(uint8_t i) {

    uint8_t shift = 'A';

    shift ^= mask_7bit(i >= 26) & 6;
    shift ^= mask_8bit(i >= 52) & 187;
    shift ^= mask_7bit(i >= 62) & 17;
    shift ^= mask_7bit(i >= 63) & 29;

    return (shift + i) & 0xff; // add modulo 256
}

SWAR notes

The elementary operation is building a mask according to comparison result, the methods to do this are described earlier.

As we see, masks for all but one constants can be 7-bit. However, we could split constant 187 into MSB and lower part. The lower part contributes in shift, while the MSB will be used to adjust sum shift + i. This solution has one important advantage, the sum would never wrap, and plain addition can be used.

Pseudocode

Rewritten algorithm.

char lookup(uint8_t i) {

    uint8_t shift = 'A';

    shift  ^= mask_7bit(i >= 26) & 6;
    shift  ^= mask_7bit(i >= 62) & 17;
    shift  ^= mask_7bit(i >= 63) & 29;

    cmp_MSB = MSB(i >= 52);
    mask    = MSB - (MSB >> 7); // inlined mask_7bit

    // Note: order of updating shift isn't important
    shift  ^= mask & 59; // 187 & 0x7f

    uint8_t sum = shift + i;

    return sum ^ cmp_MSB;
}

C++ implementation

const uint8_t XOR_ALL = 0x96;

__m512i lookup_incremental_logic(const __m512i in) {

    __m512i shift = packed_byte('A');
    __m512i c0, c1, c2, c3;
    const __m512i MSB = packed_byte(0x80);

    // shift ^= cmp(i >= 26) & 6;
    c0 = _mm512_and_si512(_mm512_add_epi32(in, packed_byte(0x80 - 26)), MSB);
    c0 = _mm512_sub_epi32(c0, _mm512_srli_epi32(c0, 7));
    c0 = _mm512_and_si512(c0, packed_byte(6));

    // shift ^= cmp(i >= 52) & 187;
    c1 = _mm512_and_si512(_mm512_add_epi32(in, packed_byte(0x80 - 52)), MSB);
    const __m512i c1msb = c1;
    c1 = _mm512_sub_epi32(c1, _mm512_srli_epi32(c1, 7));
    c1 = _mm512_and_si512(c1, packed_byte(187 & 0x7f));

    // shift ^= cmp(i >= 62) & 17;
    c2 = _mm512_and_si512(_mm512_add_epi32(in, packed_byte(0x80 - 62)), MSB);
    c2 = _mm512_sub_epi32(c2, _mm512_srli_epi32(c2, 7));
    c2 = _mm512_and_si512(c2, packed_byte(17));

    // shift ^= cmp(i >= 63) & 29;
    c3 = _mm512_and_si512(_mm512_add_epi32(in, packed_byte(0x80 - 63)), MSB);
    c3 = _mm512_sub_epi32(c3, _mm512_srli_epi32(c3, 7));
    c3 = _mm512_and_si512(c3, packed_byte(29));

    shift = _mm512_ternarylogic_epi32(shift, c0, c1, XOR_ALL);
    shift = _mm512_ternarylogic_epi32(shift, c2, c3, XOR_ALL);

    // produce the result
    return _mm512_xor_si512(_mm512_add_epi32(in, shift), c1msb);
}

Number of constants: 10.

Number of instructions:

  • bit-and: 8
  • bit-xor: 1
  • ternary: 2
  • add/sub: 9
  • shift: 4

Total 24 instructions.

Improved C++ implementation

An expression like shift ^= cmp(...) & const might be expressed with ternary logic. This transformation makes the above procedure slightly shorter, four bit-ands are replaced with two ternary logic instructions.

__m512i lookup_incremental_logic_improved(const __m512i in) {

    using namespace avx512f_swar;

    __m512i shift;
    __m512i c0, c1, c2, c3;
    const __m512i MSB = packed_byte(0x80);

    /*
        s m c | shift ^ (mask & constant)
        ------+--------------------------
        0 0 0 | 0
        0 0 1 | 0
        0 1 0 | 0
        0 1 1 | 1
        1 0 0 | 1
        1 0 1 | 1
        1 1 0 | 1
        1 1 1 | 0 -> 0x78
    */
    const uint8_t XOR_AND = 0x78;

    // shift ^= cmp(i >= 26) & 6;
    // shift ^= cmp(i >= 52) & 187;
    // shift ^= cmp(i >= 62) & 17;
    // shift ^= cmp(i >= 63) & 29;

    c0 = _mm512_cmpge_mask7bit(in, packed_byte(0x80 - 26));
    c1 = _mm512_and_si512(_mm512_add_epi32(in, packed_byte(0x80 - 52)), MSB);
    const __m512i c1msb = c1;
    c1 = _mm512_sub_epi32(c1, _mm512_srli_epi32(c1, 7));
    c2 = _mm512_cmpge_mask7bit(in, packed_byte(0x80 - 62));
    c3 = _mm512_cmpge_mask7bit(in, packed_byte(0x80 - 63));

    shift = _mm512_ternarylogic_epi32(packed_byte('A'), c0, packed_byte(6), XOR_AND);
    shift = _mm512_ternarylogic_epi32(shift, c1, packed_byte(187 & 0x7f), XOR_AND);
    shift = _mm512_ternarylogic_epi32(shift, c2, packed_byte(17), XOR_AND);
    shift = _mm512_ternarylogic_epi32(shift, c3, packed_byte(29), XOR_AND);

    // produce the result
    return _mm512_xor_si512(_mm512_add_epi32(in, shift), c1msb);
}

Storing data

Fortunately result vector doesn't require any post-processing and can be stored directly.

_mm512_storeu_si512(ptr, result);

Base64 decoding

Base64 decoding consists following steps:

  1. Load data, a 64-byte vector.
  2. Translate each byte of the vector from base64 ASCII chars back into 6-bit values. At this stage input validation is performed.
  3. Pack 6-bit sequences into continuous bit string, forming a 48-byte array.
  4. Save the result.
img/base64/base64-avx512-decoding.png

Loading data

Symmetrically to storing encoded data, loading such data doesn't require any effort, it is a simple load.

__m512i in = _mm512_loadu_si512(input);

Vectorized lookup

Introduction

Unlike encoding, decoding has to care about input's validity. However, the scheme of decoding is similar to encoding: depending on byte range (A-Z, a-z, 0-9, + or /) a shift value is selected and the shift is added to byte forming a 6-bit value. If a byte doesn't match any range, then shift value is zero, denoting an invalid input.

Very important observation is that a valid input can't contains any extended ASCII characters, i.e. bytes should be less than 0x80. Before doing a lookup procedure the input bytes are reduced to 7-bit value. This leads to simpler code, as SWAR procedures don't have to deal with full 8-bit numbers. Information about invalid, extended ASCII chars is used during a validation step.

Pseudocode for decoding summing up the above observations.

uint8_t decode_byte(uint8_t input)

    uint8_t shift = 0;          // invalid value

    input = byte & 0x7f;        // reducing range to 0..0x7f

    if (input >= 'A' && input <= 'Z')
        shift = -65;

    if (input >= 'a' && input <= 'z')
        shift = -71;

    if (input >= '0' && input <= '9')
        shift = 4;

    if (input == '+')
        shift = 19;

    if (input == '/')
        shift = 16;

    if (shift == invalid || (byte & 0x80))
        // an invalid byte or an extended ASCII char
        report error

    // sum modulo 256, as shift can be less than 0
    return (byte + shift) & 0xff;
}

SWAR notes

One of the most important part of procedure is range checking. Since we know how to express the relation >=, the range checking could be replaced with:

if (byte >= lo && !byte >= hi)
    // ...

This involves two additions and one and-not operation, followed by mask forming. And since building mask requires MSB value, we can use ternary logic to do all needed binary operations in one instruction:

const __m512i L = _mm512_add_epi32(a, packed_byte(lo));
const __m512i H = _mm512_add_epi32(a, packed_byte(hi));

// MSB = L & ~H & 0x80
const __m512i MSB = _mm512_ternarylogic_epi32(packed_byte(0x80), H, L, 0x20);

Equality comparison (required for chars '+' and '/') has surprisingly the same complexity as range checking, so the latter operation is used.

Adding shift and input uses arithmetic modulo 256.

Pseudocode

uint8_t decode_byte(uint8_t byte) {

    uint8_t shift = 0;

    input    = byte & 0x7f;        // reducing range to 0..0x7f

    // determine shift
    mask1    = mask_8bit(input >= 'A' && !input >= 'Z' + 1)
    range_AZ = mask1 & -65;

    mask2    = mask_8bit(input >= 'a' && !input >= 'z' + 1)
    range_az = mask2 & -71;

    MSB      = mask_1bit(input >= '0' && !input >= '9' + 1)
    range_09 = MSB >> 5; // value 4

    mask3    = mask_7bit(input >= '+' && !input >= '+' + 1)
    plus     = mask3 & 19;

    mask4    = mask_7bit(input >= '/' && !input >= '/' + 1)
    slash    = mask4 & 16;

    shift   = range_AZ | range_az | range_09 | plus | slash;

    // addition modulo 256
    shift06 = shift & 0x7f;
    shift7  = shift & 0x80;
    result  = (byte + shift) ^ shift7;

    // valid iff: shift is non-zero, input's MSB is zero

    non_zero_shift_7lower = shift06 + 0x7f;
    non_zero_shift_MSB    = shift7;

    valid = (non_zero_shift_7lower | non_zero_shift_MSB) & ~input

    if (valid & 0x80 != 0x80) {
        // invalid byte or extended ASCII char
        report error
    }

    return result;
}

C++ implementation

The vectorized version of pseudocode, most binary operations were expressed with ternary logic instructions:

__m512i lookup_comparisons(const __m512i input) {

    const __m512i in = _mm512_and_si512(input, packed_byte(0x7f));

    #define GET_RANGE_SHIFT_7BIT(shift, lo, hi) \
    _mm512_and_si512(packed_byte(uint8_t(shift)), \
                     _mm512_range_mask_7bit(in, packed_byte(0x80 - (lo)), packed_byte(0x80 - (hi))))

    #define GET_RANGE_SHIFT_8BIT(shift, lo, hi) \
    _mm512_and_si512(packed_byte(uint8_t(shift)), \
                     _mm512_range_mask_8bit(in, packed_byte(0x80 - (lo)), packed_byte(0x80 - (hi))))

    #define GET_RANGE_SHIFT_3RD_BIT(lo, hi) \
    _mm512_range_3rd_bit(in, packed_byte(0x80 - (lo)), packed_byte(0x80 - (hi)))


    const __m512i range_AZ   = GET_RANGE_SHIFT_8BIT   (-65, 'A', 'Z' + 1);
    const __m512i range_az   = GET_RANGE_SHIFT_8BIT   (-71, 'a', 'z' + 1);
    const __m512i range_09   = GET_RANGE_SHIFT_3RD_BIT('0', '9' + 1);
    const __m512i char_plus  = GET_RANGE_SHIFT_7BIT   (19, '+', '+' + 1);
    const __m512i char_slash = GET_RANGE_SHIFT_7BIT   (16, '/', '/' + 1);

    // shift = range_AZ | range_az | range_09 | char_plus | char_slash
    const uint8_t OR_ALL  = 0xfe;
    const __m512i tmp     = _mm512_ternarylogic_epi32(range_AZ, range_az, range_09, OR_ALL);
    const __m512i shift   = _mm512_ternarylogic_epi32(char_plus, char_slash, tmp,   OR_ALL);

    // (shift + input) & 0xff
    const __m512i shift06 = _mm512_and_si512(shift, packed_byte(0x7f));
    const __m512i shift7  = _mm512_and_si512(shift, packed_byte(0x80));
    const __m512i result  = _mm512_xor_si512(_mm512_add_epi32(in, shift06), shift7);

    // validation
    const __m512i non_zero_7lower = _mm512_add_epi32(shift06, packed_byte(0x7f));
    const __m512i MSB = packed_byte(0x80);
    const __m512i valid = _mm512_ternarylogic_epi32(non_zero_7lower, shift, input, 0x54);

    const auto mask = _mm512_cmpneq_epi32_mask(MSB, _mm512_and_si512(valid, MSB));
    if (mask) {
        report_exception(input);
    }

    return result;
}

And SWAR procedures for mask calculations:

// returns packed (a[i] >= lo and a[i] <= hi) ? 0x7f : 0x00;
__m512i _mm512_range_mask_7bit(const __m512i a, const __m512i lo, const __m512i hi) {

    const __m512i L = _mm512_add_epi32(a, lo);
    const __m512i H = _mm512_add_epi32(a, hi);

    // MSB = L & ~H & 0x80
    const __m512i MSB = _mm512_ternarylogic_epi32(packed_byte(0x80), H, L, 0x20);

    // MSB - (MSB >> 7)
    return _mm512_sub_epi32(MSB, _mm512_srli_epi32(MSB, 7));
}

// returns packed (a[i] >= lo and a[i] <= hi) ? 0xff : 0x00;
__m512i _mm512_range_mask_8bit(const __m512i a, const __m512i lo, const __m512i hi) {

    const __m512i L = _mm512_add_epi32(a, lo);
    const __m512i H = _mm512_add_epi32(a, hi);

    const __m512i MSB = _mm512_ternarylogic_epi32(packed_byte(0x80), H, L, 0x20);

    // MSB | (MSB - (MSB >> 7))
    return _mm512_or_si512(MSB, _mm512_sub_epi32(MSB, _mm512_srli_epi32(MSB, 7)));
}


// returns packed (a[i] >= lo and a[i] <= hi) ? 0x04 : 0x00;
__m512i _mm512_range_3rd_bit(const __m512i a, const __m512i lo, const __m512i hi) {

    const __m512i L = _mm512_add_epi32(a, lo);
    const __m512i H = _mm512_add_epi32(a, hi);

    const __m512i MSB = _mm512_ternarylogic_epi32(packed_byte(0x80), H, L, 0x20);

    return _mm512_srli_epi32(MSB, 5);
}

Packing data update

The result of the lookup procedure is a vector of 32-bit lanes. Each lane has following layout:

[00dddddd|00cccccc|00bbbbbb|00aaaaaa]

Intermediate result:

[00000000|aaaaaabb|bbbbcccc|ccdddddd]

And finally, after byte swapping it is:

[00000000|ccdddddd|bbbbcccc|aaaaaabb]
img/base64/base64pack.png

Packing has exactly the same problems as unpacking step in encoding. Only shifts and bit-merges can be used. The code use the same template merge as encoding.

__m512i pack_improved(const __m512i in) {

    // in = |00dddddd|00cccccc|00bbbbbb|00aaaaaa|

    // t0 = |00000000|00000000|00000000|aaaaaa00|
    const __m512i t0 = _mm512_slli_epi32(masked(in, 0x0000003f), 2);

    // t1 = |00000000|00000000|00000000|aaaaaabb|
    const __m512i t1 = merge<12, 0x00000003>(t0, in);

    // t2 = |00000000|00000000|bbbb0000|aaaaaabb|
    const __m512i t2 = merge<-4, 0x0000f000>(t1, in);

    // t3 = |00000000|00000000|bbbbcccc|aaaaaabb|
    const __m512i t3 = merge<10, 0x00000f00>(t2, in);

    // t4 = |00000000|cc000000|bbbbcccc|aaaaaabb|
    const __m512i t4 = merge<-6, 0x00c00000>(t3, in);

    // t5 = |00000000|ccdddddd|bbbbcccc|aaaaaabb|
    const __m512i t5 = merge< 8, 0x003f0000>(t4, in);

    return t5;
}

Storing data

Scatter-based

We can store 48 bytes using the scatter instruction, which saves 32-bit lanes at certain offsets [0, 3, 6, 9, and so on]; the vector of offset is exactly the same as encoding use.

According to the documentation, scatter stores items sequentially. Since indices are in ascending order, all 4th bytes of lane will be overwritten by data from the next lane.

_mm512_i32scatter_epi32(out, scatter_offsets, result, 1);

Vectorized approach

It is possible to do all shifting within registers, unfortunately there is no instruction allowing to shift across 64-bit boundary. This make decoding procedure more complicated than compared to the encoding counterpart.

The algorithm:

  1. Input is treated as four 128-bit lanes. The layout of a lane:

    [.. D2 D1 D0|.. C2 C1 C0|.. B2 B1 B0|.. A2 A1 A0] x 4
      3           2           1           0
    
  2. Shift fields C and A left by 8 bits (_mm512_mask_slli_epi32):

    [.. D2 D1 D0|C2 C1 C0 ..|.. B2 B1 B0|A2 A1 A0 ..] x 4
    
  3. Shift fields D, B and A right by 8 and 24 bits (_mm512_srlv_epi64):

    [.. .. .. ..|D2 D1 D0 C2|.. .. B2 B1|B0 A2 A1 A0] x 4
    

    Now only bytes C2 and C1 have to be moved to word 1, and this is an expansive procedure.

  4. Shift the input vector right by a word (_mm512_alignr_epi32):

    [E2 E1 E0 ..|.. D2 D1 D0|C2 C1 C0 ..|.. B2 B1 B0] x 4
                             ^^^^^
                    only these bytes are needed
    
  5. Shift and mask bytes C2 and C1 (_mm512_maskz_slli_epi32):

    [.. .. .. ..|.. .. .. ..|C1 C0 .. ..|.. .. .. ..] x 4
    
  6. Merge vectors from 3rd and 5th steps (_mm512_ternarylogic_epi32):

    [.. .. .. ..|D2 D1 D0 C2|C1 C0 B2 B1|B0 A2 A1 A0] x 4
    
  7. Shuffle words from 128-bit lanes to form 48-byte result (_mm512_permutexvar_epi32).

An excerpt from sample code.

const uint32_t BIT_MERGE = 0xca;

const __m512i t1 = _mm512_mask_slli_epi32(packed, 0x5555, input, 8);

const __m512i s2 = _mm512_setr_epi64(8, 24, 8, 24, 8, 24, 8, 24);
const __m512i t2 = _mm512_srlv_epi64(t1, s2);

const __m512i t3 = _mm512_alignr_epi32(t1, t1, 1);

const __m512i t4 = _mm512_maskz_slli_epi32(0x2222, t3, 8);

const __m512i m5 = _mm512_setr_epi32(
    0x00000000, 0xffff0000, 0x00000000, 0x00000000,
    0x00000000, 0xffff0000, 0x00000000, 0x00000000,
    0x00000000, 0xffff0000, 0x00000000, 0x00000000,
    0x00000000, 0xffff0000, 0x00000000, 0x00000000);

const __m512i t5 = _mm512_ternarylogic_epi32(m5, t4, t2, BIT_MERGE);

const __m512i s6 = _mm512_setr_epi32(
     0,  1,  2,
     4,  5,  6,
     8,  9, 10,
    12, 13, 14,
    // unused
     0,  0,  0, 0);

const __m512i t6 = _mm512_permutexvar_epi32(s6, t5);

Base64 coding and decoding using gather

A repository contains also lookup procedures use gather instructions. The procedures are pretty short and clean (no SWAR tricks), however they use quite large tables: encoding 4 * 4 * 64 bytes (1kB), decoding 4 * 4 * 256 (4kB).

And the most important fact is that performance of both procedures is worse than vectorized counterparts.

Experiment results update

Sample programs from repository tests dozens of different variants of scalar, SSE and AVX2 implementations. I picked the best methods from scalar, SSE and AVX2 groups.

Test programs were compiled by GCC 5.3.0 and run on Knights Landing processor, model 7210.

Note that tests perform whole encoding/decoding procedures, thus memory transfers are included in results.

Encoding

See all measurements. The best non-AVX512F procedures:

  • scalar — a 64 bit variant;
  • SSE — fully unrolled, with an improved lookup;
  • AVX2 — unrolled with an improved lookup.
procedure AVX512 specific time [s] speed-up relative to ...
load lookup AVX2 SSE scalar
scalar N/A 0.58455 0.52 0.50 1.00
SSE 0.29187 1.04 1.00 2.00
AVX2 0.30332 1.00 0.96 1.93
AVX512 vectorized vectorized 0.12472 2.43 2.34 4.69
vectorized improved 0.11348 2.67 2.57 5.15
gather improved 0.12745 2.38 2.29 4.59
vectorized binary search 0.12521 2.24 2.33 4.67
gather 0.25674 1.18 1.14 2.28

Decoding

See all measurements. The best non-AVX512F procedures:

  • scalar — an improved version;
  • SSE — base lookup, pack naive;
  • AVX2 — base lookup, pack naive.
procedure AVX512 specific time [s] speed-up relative to ...
lookup store AVX2 SSE scalar
scalar N/A 0.2289 0.36 0.54 1.00
SSE 0.1234 0.67 1.00 1.85
AVX2 0.0832 1.00 1.48 2.75
AVX512 vectorized scatter 0.0519 1.72 2.55 4.74
vectorized 0.0441 1.89 2.80 5.19
gather 0.0519 1.60 2.38 4.41

Conclusions

  • AVX512F implementation is always faster.
  • Surprisingly, even gather-based decoding is faster than plain vector AVX2 procedure.
  • Although gather and scatter instructions are pretty fast, vectorized versions of load and store outperform hardware.
  • On Knights Landing AVX2 code is merely 1.2-1.3 times faster than SSE code.

Further work

I've deliberately left two ideas aside:

Acknowledgments

This work wouldn't be possible without Daniel Lemire, who has gave me access to a brand new machine having Knights Landing CPU. I could do all experiments on a real processor. Daniel also shared some ideas about an early version of this text. Thank you!

Source code

Github repository contains all procedures and sample programs.