Author: | Wojciech Muła |
---|---|

Date: | 2016-11-28 |

Updated on: | 2017-04-29 (ARMv8 results) 2017-01-30 (better timings from Raspberry Pi 3), 2017-01-26 (ARM Neon & results from Raspberry Pi 3), 2017-01-25 (spelling), 2016-12-22 (added results from Bulldozer), 2016-11-30 (spelling) |

Popular programming languages provide methods or functions which locate a
substring in a given string. In C it is function `strstr`, C++
class `std::string` has method `find`, Python `string` has methods
`pos` and `index`, and so on, so forth. All these APIs were designed for
**one-shot searches**. During past decades several algorithms to solve this
problem were designed, an excellent page by **Christian Charras** and
**Thierry Lecroq** lists most of them (if not all). Basically these
algorithms could be split into two major categories: (1) based on
Deterministic Finite Automaton, like Knuth-Morris-Pratt, Boyer Moore, etc.,
and (2) based on simple comparison, like Karp-Rabin algorithm.

The main problem with these standard algorithms is a silent assumption that comparing a pair of characters, looking up in an extra table and conditions are cheap, while comparing two substrings is expansive.

But current desktop CPUs do not meet this assumption, in particular:

- There is no difference in comparing one, two, four or 8 bytes on a 64-bit CPU. When a processor supports SIMD instructions, then comparing vectors (it means 16, 32 or even 64 bytes) is as cheap as comparing a single byte.
- Thus comparing short sequences of chars can be faster than fancy algorithms which avoids such comparison.
- Looking up in a table costs one memory fetch, so at least a L1 cache round (~3 cycles). Reading char-by-char also cost as much cycles.
- Mispredicted jumps cost several cycles of penalty (~10-20 cycles).
- There is a short chain of dependencies: read char, compare it, conditionally jump, which make hard to utilize out-of-order execution capabilities present in a CPU.

This article shows two approaches utilizing SIMD instructions which I've already described in SIMD-friendly Rabin-Karp modification and SSE4 string search — modification of Karp-Rabin algorithm. I merged the articles, compare these two methods and extended material. Article shows also performance results for various implementations, ranging from SWAR to AVX512F.

The Karp-Rabin algorithm does the exact substring comparison whenever **a weak
hashes** are equal. One hash is calculated just once for searched substring,
and another is calculated for string's portion; in every iteration the second
hash is updated at small cost. Following code shows the idea:

k := substring length h1 := hash(substring) h2 := hash(string[i .. i + k]) for i in 0 .. n - k loop if h1 == h2 then if substring == string[i .. i + k] then return i end if end if h = next_hash(h, ...) # this meant to be cheap end loop

SIMD solutions replace the hash predicate with **a vector predicate**, which
is calculated in parallel and, hopefully, is calculated fast. For each
"true" element of predicate vector an exact comparison of substrings is
performed.

This is one source of improvement, another is a careful implementation.
A generic implementation calls a function like `memcmp` to compare
substrings. But while we know the length of searched substring, we may
provide specialisations for certain lengths, where a subprocedure call
is replaced by a few CPU instructions, even just one. Thanks to that
cost of calling the procedure and all internal `memcmp` costs are
simply ridden off.

This algorithm is suitable for all SIMD instruction sets and also SWAR. It
uses equality of **the first** and **the last** characters from substring as a
predicate.

These two characters are populated in two registers, **F** and **L**
respectively. Then in each iteration two chunks of strings are loaded. The
first chunk (**A**) is read from offset `i` (where `i` is the current
offset) and the second chunk (**B**) is read from offset `i + k - 1`, where
`k` is substring's length.

Then we compute a vector expression `F == A and B == L`. This step yields a
byte vector (or a bit mask), where "true" values denote position of potential
substring occurrences. Finally just at these positions an exact comparisons of
substrings are done.

Let's assume 8-byte registers. We're searching for word "cat", thus:

F = [ c | c | c | c | c | c | c | c ] L = [ t | t | t | t | t | t | t | t ]

We're searching in string "a_cat_tries". In the first iteration register
**A** gets data from offset 0, **B** from offset 2:

A = [ a | _ | c | a | t | _ | t | r ] B = [ c | a | t | _ | t | r | i | e ]

Now we compare:

AF = (A == F) = [ 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 ] BL = (B == L) = [ 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 ]

After merging comparison results, i.e. `AF & BL`, we get mask:

mask = [ 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 ]

Since mask is non-zero, it means there are possible substring occurrences. As we see there is only one non-zero element at index 2, thus only one substring comparison must be performed.

Choosing the first and the last character from a substring is not always a wise decision. Consider following scenario: a string contains mostly 'A' characters, and a user wants to find "AjohndoeA" — in such situation the number of char-wise would be large.

In order to prevent such situations an implementation can pick "last" character as the farthest character not equal to the first one. If there is no such character, it means that all characters in substring are the same (for example "AAAAA"). A specialised procedure may be used to handle such patterns.

Both SSE and AVX2 versions are practically the same, and both use the minimum number of instruction. Below is a generic AVX2 version.

It's worth to note that since we already know that the first and the last
characters match, we don't need to compare them again with `memcmp`.

size_t sse2_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) { const __m256i first = _mm256_set1_epi8(needle[0]); const __m256i last = _mm256_set1_epi8(needle[k - 1]); for (size_t i = 0; i < n; i += 32) { const __m256i block_first = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(s + i)); const __m256i block_last = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(s + i + k - 1)); const __m256i eq_first = _mm256_cmpeq_epi8(first, block_first); const __m256i eq_last = _mm256_cmpeq_epi8(last, block_last); uint32_t mask = _mm256_movemask_epi8(_mm256_and_si256(eq_first, eq_last)); while (mask != 0) { const auto bitpos = bits::get_first_bit_set(mask); if (memcmp(s + i + bitpos + 1, needle + 1, k - 2) == 0) { return i + bitpos; } mask = bits::clear_leftmost_set(mask); } } return std::string::npos; }

SWAR comparison for equality uses bit xor operation, which yields zero when two bytes are equal. Therefore instead of anding partial results, a bitwise or is used. Clearly this part of algorithm has the same complexity as the SSE/AVX2 code.

However, SWAR requires more effort to locate zero bytes. Following procedure
calculates **an exact byte mask**, where MSBs of `zeros` are set when the
corresponding byte in `x` is zero.

// 7th bit set if lower 7 bits are zero const uint64_t t0 = (~x & 0x7f7f7f7f7f7f7f7fllu) + 0x0101010101010101llu; // 7th bit set if 7th bit is zero const uint64_t t1 = (~x & 0x8080808080808080llu); uint64_t zeros = t0 & t1;

C++ implementation for 64-bit vectors. The while loop contains an additional condition which might look not optimal. But searching the first set bit and later clearing it (as the SSE version does) is slower.

size_t swar64_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) { const uint64_t first = 0x0101010101010101llu * static_cast<uint8_t>(needle[0]); const uint64_t last = 0x0101010101010101llu * static_cast<uint8_t>(needle[k - 1]); uint64_t* block_first = reinterpret_cast<uint64_t*>(const_cast<char*>(s)); uint64_t* block_last = reinterpret_cast<uint64_t*>(const_cast<char*>(s + k - 1)); for (auto i=0u; i < n; i+=8, block_first++, block_last++) { const uint64_t eq = (*block_first ^ first) | (*block_last ^ last); const uint64_t t0 = (~eq & 0x7f7f7f7f7f7f7f7fllu) + 0x0101010101010101llu; const uint64_t t1 = (~eq & 0x8080808080808080llu); uint64_t zeros = t0 & t1; size_t j = 0; while (zeros) { if (zeros & 0x80) { const char* substr = reinterpret_cast<char*>(block_first) + j + 1; if (memcmp(substr, needle + 1, k - 2) == 0) { return i + j; } } zeros >>= 8; j += 1; } } return std::string::npos; }

AVX512F lacks of operations on bytes, the smallest vector item is a 32-bit word. The limitation forces us to use SWAR techniques.

- Using AVX512F instructions we compare two vectors, like in SWAR version, i.e. two xors joined with bitwise or. There is only one difference, a single ternary logic instruction expresses one xor and bitwise or.
- Using AVX512F instructions we locate which 32-bit elements contain any zero byte.
- Then for such 32-bit element check four substrings for equality.

Unlike the SWAR procedure, where we need a precise mask for zero bytes, an AVX512F procedure requires just information "a word has zero byte". A simpler algorithm, described in Bit Twiddling Hacks is used; below is its C++ implementation.

__mmask16 zero_byte_mask(const __m512i v) { const __m512i v01 = _mm512_set1_epi32(0x01010101u); const __m512i v80 = _mm512_set1_epi32(0x80808080u); const __m512i v1 = _mm512_sub_epi32(v, v01); // tmp1 = (v - 0x01010101) & ~v & 0x80808080 const __m512i tmp1 = _mm512_ternarylogic_epi32(v1, v, v80, 0x20); return _mm512_test_epi32_mask(tmp1, tmp1); }

Generic C++ implementation.

#define _mm512_set1_epu8(c) _mm512_set1_epi32(uint32_t(c) * 0x01010101u) size_t avx512f_strstr_v2_anysize(const char* string, size_t n, const char* needle, size_t k) { assert(n > 0); assert(k > 0); const __m512i first = _mm512_set1_epu8(needle[0]); const __m512i last = _mm512_set1_epu8(needle[k - 1]); char* haystack = const_cast<char*>(string); char* end = haystack + n; for (/**/; haystack < end; haystack += 64) { const __m512i block_first = _mm512_loadu_si512(haystack + 0); const __m512i block_last = _mm512_loadu_si512(haystack + k - 1); const __m512i first_zeros = _mm512_xor_si512(block_first, first); // zeros = first_zeros | (block_last ^ last) const __m512i zeros = _mm512_ternarylogic_epi32(first_zeros, block_last, last, 0xf6); uint32_t mask = zero_byte_mask(zeros); while (mask) { const uint64_t p = __builtin_ctz(mask); if (memcmp(haystack + 4*p + 0, needle, k) == 0) { return (haystack - string) + 4*p + 0; } if (memcmp(haystack + 4*p + 1, needle, k) == 0) { return (haystack - string) + 4*p + 1; } if (memcmp(haystack + 4*p + 2, needle, k) == 0) { return (haystack - string) + 4*p + 2; } if (memcmp(haystack + 4*p + 3, needle, k) == 0) { return (haystack - string) + 4*p + 3; } mask = bits::clear_leftmost_set(mask); } } return size_t(-1); }

The algorithm can be also easily realised using ARM Neon instructions, having 128-bit SIMD registers. The only problem is caused by long round trip from the Neon unit back to the CPU.

It was solved by saving back the comparison result in a 64 bit word in memory: lower nibbles come from the lower half of a SIMD register, likewise higher nibbles come from the higher half of the register.

Then comparison is done in two loops, separately for lower and higher nibbles, to detect substring occurrences in the correct order.

Below is a sample implementation.

size_t FORCE_INLINE neon_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) { assert(k > 0); assert(n > 0); const uint8x16_t first = vdupq_n_u8(needle[0]); const uint8x16_t last = vdupq_n_u8(needle[k - 1]); const uint8x8_t half = vdup_n_u8(0x0f); const uint8_t* ptr = reinterpret_cast<const uint8_t*>(s); union { uint8_t tmp[8]; uint32_t word[2]; }; for (size_t i = 0; i < n; i += 16) { const uint8x16_t block_first = vld1q_u8(ptr + i); const uint8x16_t block_last = vld1q_u8(ptr + i + k - 1); const uint8x16_t eq_first = vceqq_u8(first, block_first); const uint8x16_t eq_last = vceqq_u8(last, block_last); const uint8x16_t pred_16 = vandq_u8(eq_first, eq_last); const uint8x8_t pred_8 = vbsl_u8(half, vget_low_u8(pred_16), vget_high_u8(pred_16)); vst1_u8(tmp, pred_8); if ((word[0] | word[1]) == 0) { continue; } for (int j=0; j < 8; j++) { if (tmp[j] & 0x0f) { if (memcmp(s + i + j + 1, needle + 1, k - 2) == 0) { return i + j; } } } for (int j=0; j < 8; j++) { if (tmp[j] & 0xf0) { if (memcmp(s + i + j + 1 + 8, needle + 1, k - 2) == 0) { return i + j + 8; } } } } return std::string::npos; }

It appeared that unrolling the two inner loops brought about 1.2 speedup.

AArch64 code is almost the exact copy of the above ARM Neon procedure. The only exception is direct reading of SIMD registers lanes, as the architecture made this operation fast.

SSE4.1 and AVX2 provide instruction `MPSADBW`, which calculates eight
Manhattan distances (L1) between given 4-byte sub-vector from one register
and eight subsequent 4-byte sub-vector from second register. The instruction
returns vector of eight words (16-bit values).

When two sub-vectors are equal, then the L1 distance is 0, and we may use this
property to locate possible substring locations. In other words **equality of
four leading characters** is used as a predicate.

Albeit it seems to be a stronger predicate than matching the first and the last characters, a quadratic complexity is unavoidable. For example, when the searched string contains one letter "a", and we're looking for "aaaabcde", then the predicate obviously will be true for all input characters.

If it isn't enough, there are following problems:

- This method handles substring not shorter than four characters. Handling three-char substrings is viable, but require additional code.
- SSE variant of
`MPSADBW`processes only 8 bytes at once, while the generic SIMD variant uses the whole register. - AVX2 variant of
`MPSADBW`works on lanes, i.e. 128-bit helves of a register rather than the whole 256-bit register. This imposes additional code to properly load data. - Latency of the instruction is pretty hight — 5 or 7 cycles, depending on CPU architecture. Luckily throughput is 1 or 2 cycles, thus unrolling a loop can hide latency.

The generic, simplest implementation.

size_t sse4_strstr_anysize(const char* s, size_t n, const char* needle, size_t needle_size) { const __m128i prefix = _mm_loadu_si128(reinterpret_cast<const __m128i*>(needle)); const __m128i zeros = _mm_setzero_si128(); for (size_t i = 0; i < n; i += 8) { const __m128i data = _mm_loadu_si128(reinterpret_cast<const __m128i*>(s + i)); const __m128i result = _mm_mpsadbw_epu8(data, prefix, 0); const __m128i cmp = _mm_cmpeq_epi16(result, zeros); unsigned mask = _mm_movemask_epi8(cmp) & 0x5555; while (mask != 0) { const auto bitpos = bits::get_first_bit_set(mask)/2; if (memcmp(s + i + bitpos + 4, needle + 4, needle_size - 4) == 0) { return i + bitpos; } mask = bits::clear_leftmost_set(mask); } } return std::string::npos; }

Although AVX512F doesn't support `MPSADBW` (AVX512BW defines it) we still
can use 4-byte prefix equality as a predicate, utilizing fact that 32-bit
elements are natively supported.

In each iteration we generate four AVX512 vectors containing all possible 4-byte prefixes. Example:

string = "the-cat-tries-to-eat..." vec0 = [ t | h | e | - ][ c | a | t | - ][ t | r | i | e ][ s | - | t | o ][ ... ] vec1 = [ h | e | - | c ][ a | t | - | t ][ r | i | e | s ][ - | t | o | - ][ ... ] vec2 = [ e | - | c | a ][ t | - | t | r ][ i | e | s | - ][ t | o | - | e ][ ... ] vec3 = [ - | c | a | t ][ - | t | r | i ][ e | s | - | t ][ o | - ] e | a ][ ... ]

Vector `vec0` contains prefixes for position 0, 4, 8, 12, ...;
`vec1` — 1, 5, 9, 13, ..., `vec2` — 2, 6, 10, 14, ...;
`vec3` — 3, 7, 11, 15, etc.

Building each vector require two shifts and one bitwise or. In each iteration four vector comparison are performed and then four bitmasks are examined. This make a loop, which compares substrings, quite complicated.

Moreover, to properly fill the last elements of vectors we need four bytes
beyond vector. This is accomplished by having two adjacent vectors per
iterations (one load per iteration is needed, though). Finally, instruction
`VPALIGNR` is used to extract required data.

size_t avx512f_strstr_long(const char* string, size_t n, const char* needle, size_t k) { __m512i curr; __m512i next; __m512i v0, v1, v2, v3; char* haystack = const_cast<char*>(string); char* last = haystack + n; const uint32_t prf = *(uint32_t*)needle; // the first 4 bytes of needle const __m512i prefix = _mm512_set1_epi32(prf); next = _mm512_loadu_si512(haystack); for (/**/; haystack < last; haystack += 64) { curr = next; next = _mm512_loadu_si512(haystack + 64); const __m512i shft = _mm512_alignr_epi32(next, curr, 1); v0 = curr; { const __m512i t1 = _mm512_srli_epi32(curr, 8); const __m512i t2 = _mm512_slli_epi32(shft, 24); v1 = _mm512_or_si512(t1, t2); } { const __m512i t1 = _mm512_srli_epi32(curr, 16); const __m512i t2 = _mm512_slli_epi32(shft, 16); v2 = _mm512_or_si512(t1, t2); } { const __m512i t1 = _mm512_srli_epi32(curr, 24); const __m512i t2 = _mm512_slli_epi32(shft, 8); v3 = _mm512_or_si512(t1, t2); } uint16_t m0 = _mm512_cmpeq_epi32_mask(v0, prefix); uint16_t m1 = _mm512_cmpeq_epi32_mask(v1, prefix); uint16_t m2 = _mm512_cmpeq_epi32_mask(v2, prefix); uint16_t m3 = _mm512_cmpeq_epi32_mask(v3, prefix); int index = 64; while (m0 | m1 | m2 | m3) { if (m0) { int pos = __builtin_ctz(m0) * 4 + 0; m0 = m0 & (m0 - 1); if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) { index = pos; } } if (m1) { int pos = __builtin_ctz(m1) * 4 + 1; m1 = m1 & (m1 - 1); if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) { index = pos; } } if (m2) { int pos = __builtin_ctz(m2) * 4 + 2; m2 = m2 & (m2 - 1); if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) { index = pos; } } if (m3) { int pos = __builtin_ctz(m3) * 4 + 3; m3 = m3 & (m3 - 1); if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) { index = pos; } } } if (index < 64) { return (haystack - string) + index; } } return size_t(-1); }

SSE4.2 introduced String and Text New Instructions (STNI), a set of very complex instructions that were meant to be a building block for string operations. Unfortunately, Intel practically discontinued STNI in newer processors, hasn't introduced AVX2 versions of STNI and make them extremely slow (11 cycles latency is unacceptable).

Basically `PCMPxSTRx` instruction exists in four variants, which differs
only in:

- A way to determine string length: the length might be given explicitly or the first zero byte marks string end, as in traditional C strings.
- How the result is saved, it might be either a bit-mask/byte-mask or the number of first/last bit set in the bit-mask.

Additional instruction's argument (immediate constant) defines several aspects
of execution, specifically the algorithm of comparison. There are four
different algorithms available, one we're using is called **range ordered**.
Despite the name, this algorithm locates substring, or its prefix if a
substring goes beyond register width.

For example, when we're searching "ABCD" in "ABCD_ABC_ABCD_AB" the instruction
returns bitmask 0b0100001000000001, treating suffix "AB" as a match. Thus we
can safely assume that only **the first character matches**, as tail might or
might not be present in a register. (Of course it can be determined, but
require additional calculations which is not very handy.)

Below is a code snippet which does the above operation.

const char* s1 = "ABCD_ABC_ABCD_AB"; const char* s2 = "ABCD____________"; const int mode = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ORDERED | _SIDD_BIT_MASK; const __m128i N = _mm_loadu_si128((__m128i*)(s2)); const __m128i D = _mm_loadu_si128((__m128i*)(s1)); const __m128i res = _mm_cmpestrm(N, 4, D, 16, mode); uint64_t mask = _mm_cvtsi128_si64(res); // = 0b0100001000000001

size_t sse42_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) { const __m128i N = _mm_loadu_si128((__m128i*)needle); for (size_t i = 0; i < n; i += 16) { const int mode = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ORDERED | _SIDD_BIT_MASK; const __m128i D = _mm_loadu_si128((__m128i*)(s + i)); const __m128i res = _mm_cmpestrm(N, k, D, n - i, mode); uint64_t mask = _mm_cvtsi128_si64(res); while (mask != 0) { const auto bitpos = bits::get_first_bit_set(mask); // we know that at least the first character of needle matches if (memcmp(s + i + bitpos + 1, needle + 1, k - 1) == 0) { return i + bitpos; } mask = bits::clear_leftmost_set(mask); } } return std::string::npos; }

Performance of various SIMD implementations were measured. Test programs also
have got specialisation for short substrings, that are selected at run time.
Performance of a C `strstr` is included for comparison. I omitted C++
`string::find` due to performance bug in GNU libc which makes the method
10 times slower than `strstr`.

Test programs were run three times. Following computers, running either Debian or Ubuntu, were tested:

- Westmere i5 M540, GCC 6.2.0,
- Bulldozer FX-8150 CPU, GCC 4.8.4
- Haswell i7 4470, GCC 5.4.1,
- Skylake i7 6700, GCC 5.4.1,
- Knights Landing (KNL) 7210, GCC 5.3.0.
- ARMv7 (Raspberry Pi 3, 32-bit code), GCC 4.9.2
- ARMv8 (ARM Cortex A57 - AMD Opteron A1100, 64-bit code), Clang 3.8.0

procedure | time in seconds | ||||
---|---|---|---|---|---|

Westemere | Bulldozer | Haswell | Skylake | KNL | |

std::strstr | 0.82246 | 9.37792 | 0.52786 | 0.66148 | 4.94606 |

std::string::find | --- | --- | --- | --- | --- |

SWAR 64-bit (generic) | 2.49859 | 2.93836 | 1.57715 | 1.40404 | 8.17075 |

SSE2 (generic) | 0.74589 |
0.78871 |
0.55435 | 0.48863 |
6.10786 |

SSE4.1 (MPSADBW) | 1.45040 | 1.98863 | 0.89775 | 0.63875 |
18.71666 |

SSE4.1 (MPSADBW unrolled) | 1.23849 | 2.06008 | 0.99647 | 0.87919 | 13.72486 |

SSE4.2 (PCMPESTRM) | 1.69968 | 2.00681 | 1.55992 | 1.39063 | 6.28869 |

AVX2 (MPSADBW) | 0.61578 | 0.56981 |
13.15136 | ||

AVX2 (generic) | 0.38653 |
0.36309 |
4.09478 |
||

AVX512F (MPSADBW-like) | 2.32616 |
||||

AVX512F (generic) | 1.14057 |
||||

biggest speed-up to strstr |
1.10 | --- | 1.37 | 1.82 | 4.33 |

- Performance of
`strstr`on the machine with Bulldozer is terrible, it'd pointless to use it as a reference.

procedure | ||
---|---|---|

ARMv7 | ARMv8 | |

std::strstr | 7.30405 | 3.37546 |

std::string::find | 4.17131 | 1.81368 |

SWAR 64-bit (generic) | 36.65012 | 0.46269 |

SWAR 32-bit (generic) | 2.45058 | 0.81075 |

ARM Neon (32-bit, generic) | 1.29861 | 0.40699 |

AArch64 (64-bit, generic) | --- | 0.27897 |

biggest speed-up to std::string::find |
3.1 | 6.5 |

- The generic SIMD algorithm outperforms C
`strstr`on all platforms. An implementation should use the highest SIMD version available on a certain CPU. `MPSADBW`performs pretty bad, with an exception for Skylake. Knights Landing performance is terrible.`PCMPESTRM`performs worse than`MPSADBW`.- ARM Neon performance is pretty good even for SWAR implementation.
The SWAR version is 1.7 times faster than
`string::find`, SIMD version is 3.1 times faster. - AArch64 performance of scalar SWAR64 is almost as good as 32-bit SIMD procedure.
- Comparison with
`strstr`might be considered unfair, as the procedure deals with string of unknown length, while my implementations get lengths and take advantage of this. I fully agree. - Procedures I implemented are also unsafe, because might read data off the input string. This may lead to access violation if strings are located just before unmapped memory. And for sure address sanitizers will complain. Making the procedures safe is feasible, but it wasn't my goal.

Daniel Lemire has gave me access to Haswell, Skylake, KNL, Bulldozer and ARMv8 machines, where I compiled and run test programs. Thank you!

All implementations and tests programs are available at github.