AVX512: ternary functions evaluation

Author:Wojciech Muła
Added on:2015-03-22
Updates:2016-11-07

Contents

Introduction

Intel's version of SIMD offers following 2-argument (binary) boolean functions: and, or, xor, and not. There isn't a single argument not, this function can be expressed with xor reg, ones, however it requires additional, pre-set register.

AVX512F will come with a very interesting instruction called vpternlog. There are two variants of the instruction operating on a packed 32-bit (vpternlogd) or a 64-bit vector (vpternlogq), however they do exactly the same thing — evaluate a 3-argument (ternary) boolean function on each bit of arguments, the function is given as a truth table.

The pattern of a truth table:

inputs result
A B C
0 0 0 a
0 0 1 b
0 1 0 c
0 1 1 d
1 0 0 e
1 0 1 f
1 1 0 g
1 1 1 h

A programmer supplies only the result column, i.e. defines values of bits a through h, this is a single 8-bit value.

Depending on function complexity, a single vpternlog instruction can replace from one up to eight SIMD instructions.

So far there are no processors supporting AVX512, so it's a question how fast this instruction will be. I guess it couldn't be as fast as existing boolean instructions (1 cycle latency, 0.33 cycles throughput), but who knows.

Bit select function new

A ternary function may be seen as a select function, i.e. the most significant bit A select one of binary functions of B and C. The function for A=0 is described by bits a..d, the function for A=1 by bits e..h.

As @solardiz noted this function is available in OpenCL as bitselect(), AMD XOP has instruction VPCMOV, AltiVec has VSEL, NEON has VBSL, also several GPUs support such instruction.

Example A ? B : C.

inputs result
A B C
0 0 0 0
0 0 1 0
0 1 0 1
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 1
constant 0xac

The assembly code using binary functions:

; A - xmm0 and result
; B - xmm1
; C - xmm2

pand     xmm1, xmm0
pandn    xmm0, xmm2
por      xmm0, xmm1

Other examples

Artifical function

Let see for example function (A or not B) and C. The truth table:

inputs result
A B C
0 0 0 0
0 0 1 1
0 1 0 0
0 1 1 0
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 1

To express the function using standard SSE instructions we need three operations:

; A - xmm0 and result
; B - xmm1
; C - xmm2

pcmpeqb  xmm4, xmm4     // constant, could exist earlier
pxor     xmm0, xmm4     // A1 := not A
por      xmm0, xmm1     // AB := A or B1
pand     xmm0, xmm2     // result := AB or C

With AVX512 it would be very simple:

; a - zmm0 and result
; b - zmm1
; c - zmm2

vpternlog zmm0, zmm1, zmm2, 0xa2 // 0xa2 = 0b10100010

OR/AND/XOR all

inputs OR AND XOR
A B C
0 0 0 0 0 0
0 0 1 1 0 1
0 1 0 1 0 1
0 1 1 1 0 0
1 0 0 1 0 1
1 0 1 1 0 0
1 1 0 1 0 0
1 1 1 1 1 1
constant 0xfe 0x80 0x96

The assembly code of or all using binary functions:

; A - xmm0 and result
; B - xmm1
; C - xmm2

por      xmm0, xmm1
por      xmm0, xmm2

Exactly one bit is set

Function is true when only one bit is set.

inputs result
C B A
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 0
1 1 1 0
constant 0x16

The C code using binary functions:

// t1 = ~(A | B) & C
__m512i t1 = _mm512i_andnot_si512(_mm512_or_si512(A, B), C);

// t2 = ~C & (A ^ B)
__m512i t2 = _mm512i_andnot_si512(C, _mm512_xor_si512(A, B));

__m512i result = _mm512_or_si512(t1, t2);

The procedure uses five instructions.

Exactly two bits are set

Function is true exactly two bits are set.

inputs result
C B A
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 0
constant 0x68

The C code using binary functions:

// t1 = ~C & (A & B)
__m512i t1 = _mm512i_andnot_si512(C, _mm512_and_si512(A, B));

// t2 = C & (A ^ B)
__m512i t2 = _mm512i_and_si512(C, _mm512_xor_si512(A, B));

__m512i result = _mm512_or_si512(t1, t2);

The procedure uses five instructions.

Real-world example

The crucial function of Harley-Seal population count algorithm is carry-save adder. That function calculates two bits of 3-argument sum, i.e. the digit and the carry flag.

inputs outputs
C B A digit carry
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 0

The fastest implementation of above functions contains five logic instructions:

tmp   = a ^ b
carry = (a & b) | (tmp & c)
digit = tmp ^ c;

With help of vpternlogd (intrinsic _mm512_ternarylogic_epi32) this can be calculated as:

l = _mm512_ternarylogic_epi32(c, b, a, 0x96); // 1001_0110
h = _mm512_ternarylogic_epi32(c, b, a, 0xe8); // 0110_1000

See also

Changes