SIMD-friendly algorithms for substring searching

Author:Wojciech Muła
Date:2016-11-28
Updated on:2017-04-29 (ARMv8 results)

Introduction

Popular programming languages provide methods or functions which locate a substring in a given string. In C it is function strstr, C++ class std::string has method find, Python string has methods pos and index, and so on, so forth. All these APIs were designed for one-shot searches. During past decades several algorithms to solve this problem were designed, an excellent page by Christian Charras and Thierry Lecroq lists most of them (if not all). Basically these algorithms could be split into two major categories: (1) based on Deterministic Finite Automaton, like Knuth-Morris-Pratt, Boyer Moore, etc., and (2) based on simple comparison, like Karp-Rabin algorithm.

The main problem with these standard algorithms is a silent assumption that comparing a pair of characters, looking up in an extra table and conditions are cheap, while comparing two substrings is expansive.

But current desktop CPUs do not meet this assumption, in particular:

Solution

This article shows two approaches utilizing SIMD instructions which I've already described in SIMD-friendly Rabin-Karp modification and SSE4 string search — modification of Karp-Rabin algorithm. I merged the articles, compare these two methods and extended material. Article shows also performance results for various implementations, ranging from SWAR to AVX512F.

The Karp-Rabin algorithm does the exact substring comparison whenever a weak hashes are equal. One hash is calculated just once for searched substring, and another is calculated for string's portion; in every iteration the second hash is updated at small cost. Following code shows the idea:

k  := substring length
h1 := hash(substring)
h2 := hash(string[i .. i + k])
for i in 0 .. n - k loop
    if h1 == h2 then
        if substring == string[i .. i + k] then
            return i
        end if
    end if

    h = next_hash(h, ...) # this meant to be cheap
end loop

SIMD solutions replace the hash predicate with a vector predicate, which is calculated in parallel and, hopefully, is calculated fast. For each "true" element of predicate vector an exact comparison of substrings is performed.

This is one source of improvement, another is a careful implementation. A generic implementation calls a function like memcmp to compare substrings. But while we know the length of searched substring, we may provide specialisations for certain lengths, where a subprocedure call is replaced by a few CPU instructions, even just one. Thanks to that cost of calling the procedure and all internal memcmp costs are simply ridden off.

Algorithm 1: Generic SIMD

Algorithm

This algorithm is suitable for all SIMD instruction sets and also SWAR. It uses equality of the first and the last characters from substring as a predicate.

These two characters are populated in two registers, F and L respectively. Then in each iteration two chunks of strings are loaded. The first chunk (A) is read from offset i (where i is the current offset) and the second chunk (B) is read from offset i + k - 1, where k is substring's length.

Then we compute a vector expression F == A and B == L. This step yields a byte vector (or a bit mask), where "true" values denote position of potential substring occurrences. Finally just at these positions an exact comparisons of substrings are done.

Example

Let's assume 8-byte registers. We're searching for word "cat", thus:

F    = [ c | c | c | c | c | c | c | c ]
L    = [ t | t | t | t | t | t | t | t ]

We're searching in string "a_cat_tries". In the first iteration register A gets data from offset 0, B from offset 2:

A    = [ a | _ | c | a | t | _ | t | r ]
B    = [ c | a | t | _ | t | r | i | e ]

Now we compare:

AF   = (A == F)
     = [ 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 ]

BL   = (B == L)
     = [ 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 ]

After merging comparison results, i.e. AF & BL, we get mask:

mask = [ 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 ]

Since mask is non-zero, it means there are possible substring occurrences. As we see there is only one non-zero element at index 2, thus only one substring comparison must be performed.

First and last?

Choosing the first and the last character from a substring is not always a wise decision. Consider following scenario: a string contains mostly 'A' characters, and a user wants to find "AjohndoeA" — in such situation the number of char-wise would be large.

In order to prevent such situations an implementation can pick "last" character as the farthest character not equal to the first one. If there is no such character, it means that all characters in substring are the same (for example "AAAAA"). A specialised procedure may be used to handle such patterns.

Implementation

SSE & AVX2

Both SSE and AVX2 versions are practically the same, and both use the minimum number of instruction. Below is a generic AVX2 version.

It's worth to note that since we already know that the first and the last characters match, we don't need to compare them again with memcmp.

size_t avx2_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) {

    const __m256i first = _mm256_set1_epi8(needle[0]);
    const __m256i last  = _mm256_set1_epi8(needle[k - 1]);

    for (size_t i = 0; i < n; i += 32) {

        const __m256i block_first = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(s + i));
        const __m256i block_last  = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(s + i + k - 1));

        const __m256i eq_first = _mm256_cmpeq_epi8(first, block_first);
        const __m256i eq_last  = _mm256_cmpeq_epi8(last, block_last);

        uint32_t mask = _mm256_movemask_epi8(_mm256_and_si256(eq_first, eq_last));

        while (mask != 0) {

            const auto bitpos = bits::get_first_bit_set(mask);

            if (memcmp(s + i + bitpos + 1, needle + 1, k - 2) == 0) {
                return i + bitpos;
            }

            mask = bits::clear_leftmost_set(mask);
        }
    }

    return std::string::npos;
}

SWAR

SWAR comparison for equality uses bit xor operation, which yields zero when two bytes are equal. Therefore instead of anding partial results, a bitwise or is used. Clearly this part of algorithm has the same complexity as the SSE/AVX2 code.

However, SWAR requires more effort to locate zero bytes. Following procedure calculates an exact byte mask, where MSBs of zeros are set when the corresponding byte in x is zero.

// 7th bit set if lower 7 bits are zero
const uint64_t t0 = (~x & 0x7f7f7f7f7f7f7f7fllu) + 0x0101010101010101llu;
// 7th bit set if 7th bit is zero
const uint64_t t1 = (~x & 0x8080808080808080llu);
uint64_t zeros = t0 & t1;

C++ implementation for 64-bit vectors. The while loop contains an additional condition which might look not optimal. But searching the first set bit and later clearing it (as the SSE version does) is slower.

size_t swar64_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) {

    const uint64_t first = 0x0101010101010101llu * static_cast<uint8_t>(needle[0]);
    const uint64_t last  = 0x0101010101010101llu * static_cast<uint8_t>(needle[k - 1]);

    uint64_t* block_first = reinterpret_cast<uint64_t*>(const_cast<char*>(s));
    uint64_t* block_last  = reinterpret_cast<uint64_t*>(const_cast<char*>(s + k - 1));

    for (auto i=0u; i < n; i+=8, block_first++, block_last++) {
        const uint64_t eq = (*block_first ^ first) | (*block_last ^ last);

        const uint64_t t0 = (~eq & 0x7f7f7f7f7f7f7f7fllu) + 0x0101010101010101llu;
        const uint64_t t1 = (~eq & 0x8080808080808080llu);
        uint64_t zeros = t0 & t1;
        size_t j = 0;

        while (zeros) {
            if (zeros & 0x80) {
                const char* substr = reinterpret_cast<char*>(block_first) + j + 1;
                if (memcmp(substr, needle + 1, k - 2) == 0) {
                    return i + j;
                }
            }

            zeros >>= 8;
            j += 1;
        }
    }

    return std::string::npos;
}

AVX512F

AVX512F lacks of operations on bytes, the smallest vector item is a 32-bit word. The limitation forces us to use SWAR techniques.

  1. Using AVX512F instructions we compare two vectors, like in SWAR version, i.e. two xors joined with bitwise or. There is only one difference, a single ternary logic instruction expresses one xor and bitwise or.
  2. Using AVX512F instructions we locate which 32-bit elements contain any zero byte.
  3. Then for such 32-bit element check four substrings for equality.

Unlike the SWAR procedure, where we need a precise mask for zero bytes, an AVX512F procedure requires just information "a word has zero byte". A simpler algorithm, described in Bit Twiddling Hacks is used; below is its C++ implementation.

__mmask16 zero_byte_mask(const __m512i v) {

    const __m512i v01  = _mm512_set1_epi32(0x01010101u);
    const __m512i v80  = _mm512_set1_epi32(0x80808080u);

    const __m512i v1   = _mm512_sub_epi32(v, v01);
    // tmp1 = (v - 0x01010101) & ~v & 0x80808080
    const __m512i tmp1 = _mm512_ternarylogic_epi32(v1, v, v80, 0x20);

    return _mm512_test_epi32_mask(tmp1, tmp1);
}

Generic C++ implementation.

#define _mm512_set1_epu8(c) _mm512_set1_epi32(uint32_t(c) * 0x01010101u)

size_t avx512f_strstr_v2_anysize(const char* string, size_t n, const char* needle, size_t k) {

    assert(n > 0);
    assert(k > 0);

    const __m512i first = _mm512_set1_epu8(needle[0]);
    const __m512i last  = _mm512_set1_epu8(needle[k - 1]);

    char* haystack = const_cast<char*>(string);
    char* end      = haystack + n;

    for (/**/; haystack < end; haystack += 64) {

        const __m512i block_first = _mm512_loadu_si512(haystack + 0);
        const __m512i block_last  = _mm512_loadu_si512(haystack + k - 1);

        const __m512i first_zeros = _mm512_xor_si512(block_first, first);
        // zeros = first_zeros | (block_last ^ last)
        const __m512i zeros = _mm512_ternarylogic_epi32(first_zeros, block_last, last, 0xf6);

        uint32_t mask = zero_byte_mask(zeros);
        while (mask) {

            const uint64_t p = __builtin_ctz(mask);

            if (memcmp(haystack + 4*p + 0, needle, k) == 0) {
                return (haystack - string) + 4*p + 0;
            }

            if (memcmp(haystack + 4*p + 1, needle, k) == 0) {
                return (haystack - string) + 4*p + 1;
            }

            if (memcmp(haystack + 4*p + 2, needle, k) == 0) {
                return (haystack - string) + 4*p + 2;
            }

            if (memcmp(haystack + 4*p + 3, needle, k) == 0) {
                return (haystack - string) + 4*p + 3;
            }

            mask = bits::clear_leftmost_set(mask);
        }
    }

    return size_t(-1);
}

ARM Neon (32 bit code)

The algorithm can be also easily realised using ARM Neon instructions, having 128-bit SIMD registers. The only problem is caused by long round trip from the Neon unit back to the CPU.

It was solved by saving back the comparison result in a 64 bit word in memory: lower nibbles come from the lower half of a SIMD register, likewise higher nibbles come from the higher half of the register.

Then comparison is done in two loops, separately for lower and higher nibbles, to detect substring occurrences in the correct order.

Below is a sample implementation.

size_t FORCE_INLINE neon_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) {

    assert(k > 0);
    assert(n > 0);

    const uint8x16_t first = vdupq_n_u8(needle[0]);
    const uint8x16_t last  = vdupq_n_u8(needle[k - 1]);
    const uint8x8_t  half  = vdup_n_u8(0x0f);

    const uint8_t* ptr = reinterpret_cast<const uint8_t*>(s);

    union {
        uint8_t  tmp[8];
        uint32_t word[2];
    };

    for (size_t i = 0; i < n; i += 16) {

        const uint8x16_t block_first = vld1q_u8(ptr + i);
        const uint8x16_t block_last  = vld1q_u8(ptr + i + k - 1);

        const uint8x16_t eq_first = vceqq_u8(first, block_first);
        const uint8x16_t eq_last  = vceqq_u8(last, block_last);
        const uint8x16_t pred_16  = vandq_u8(eq_first, eq_last);
        const uint8x8_t pred_8    = vbsl_u8(half, vget_low_u8(pred_16), vget_high_u8(pred_16));

        vst1_u8(tmp, pred_8);

        if ((word[0] | word[1]) == 0) {
            continue;
        }

        for (int j=0; j < 8; j++) {
            if (tmp[j] & 0x0f) {
                if (memcmp(s + i + j + 1, needle + 1, k - 2) == 0) {
                    return i + j;
                }
            }
        }

        for (int j=0; j < 8; j++) {
            if (tmp[j] & 0xf0) {
                if (memcmp(s + i + j + 1 + 8, needle + 1, k - 2) == 0) {
                    return i + j + 8;
                }
            }
        }
    }

    return std::string::npos;
}

It appeared that unrolling the two inner loops brought about 1.2 speedup.

AArch64 (64 bit code) new

AArch64 code is almost the exact copy of the above ARM Neon procedure. The only exception is direct reading of SIMD registers lanes, as the architecture made this operation fast.

Algorithm 2: SSE-specific (MPSADBW)

Algorithm

SSE4.1 and AVX2 provide instruction MPSADBW, which calculates eight Manhattan distances (L1) between given 4-byte sub-vector from one register and eight subsequent 4-byte sub-vector from second register. The instruction returns vector of eight words (16-bit values).

When two sub-vectors are equal, then the L1 distance is 0, and we may use this property to locate possible substring locations. In other words equality of four leading characters is used as a predicate.

Albeit it seems to be a stronger predicate than matching the first and the last characters, a quadratic complexity is unavoidable. For example, when the searched string contains one letter "a", and we're looking for "aaaabcde", then the predicate obviously will be true for all input characters.

If it isn't enough, there are following problems:

  • This method handles substring not shorter than four characters. Handling three-char substrings is viable, but require additional code.
  • SSE variant of MPSADBW processes only 8 bytes at once, while the generic SIMD variant uses the whole register.
  • AVX2 variant of MPSADBW works on lanes, i.e. 128-bit helves of a register rather than the whole 256-bit register. This imposes additional code to properly load data.
  • Latency of the instruction is pretty hight — 5 or 7 cycles, depending on CPU architecture. Luckily throughput is 1 or 2 cycles, thus unrolling a loop can hide latency.

Implementation

SSE

The generic, simplest implementation.

size_t sse4_strstr_anysize(const char* s, size_t n, const char* needle, size_t needle_size) {

    const __m128i prefix = _mm_loadu_si128(reinterpret_cast<const __m128i*>(needle));
    const __m128i zeros  = _mm_setzero_si128();

    for (size_t i = 0; i < n; i += 8) {

        const __m128i data   = _mm_loadu_si128(reinterpret_cast<const __m128i*>(s + i));
        const __m128i result = _mm_mpsadbw_epu8(data, prefix, 0);

        const __m128i cmp    = _mm_cmpeq_epi16(result, zeros);

        unsigned mask = _mm_movemask_epi8(cmp) & 0x5555;

        while (mask != 0) {

            const auto bitpos = bits::get_first_bit_set(mask)/2;

            if (memcmp(s + i + bitpos + 4, needle + 4, needle_size - 4) == 0) {
                return i + bitpos;
            }

            mask = bits::clear_leftmost_set(mask);
        }
    }

    return std::string::npos;
}

AVX512F

Although AVX512F doesn't support MPSADBW (AVX512BW defines it) we still can use 4-byte prefix equality as a predicate, utilizing fact that 32-bit elements are natively supported.

In each iteration we generate four AVX512 vectors containing all possible 4-byte prefixes. Example:

string = "the-cat-tries-to-eat..."

vec0  = [ t | h | e | - ][ c | a | t | - ][ t | r | i | e ][ s | - | t | o ][ ... ]
vec1  = [ h | e | - | c ][ a | t | - | t ][ r | i | e | s ][ - | t | o | - ][ ... ]
vec2  = [ e | - | c | a ][ t | - | t | r ][ i | e | s | - ][ t | o | - | e ][ ... ]
vec3  = [ - | c | a | t ][ - | t | r | i ][ e | s | - | t ][ o | - ] e | a ][ ... ]

Vector vec0 contains prefixes for position 0, 4, 8, 12, ...; vec1 — 1, 5, 9, 13, ..., vec2 — 2, 6, 10, 14, ...; vec3 — 3, 7, 11, 15, etc.

Building each vector require two shifts and one bitwise or. In each iteration four vector comparison are performed and then four bitmasks are examined. This make a loop, which compares substrings, quite complicated.

Moreover, to properly fill the last elements of vectors we need four bytes beyond vector. This is accomplished by having two adjacent vectors per iterations (one load per iteration is needed, though). Finally, instruction VPALIGNR is used to extract required data.

size_t avx512f_strstr_long(const char* string, size_t n, const char* needle, size_t k) {

    __m512i curr;
    __m512i next;
    __m512i v0, v1, v2, v3;

    char* haystack = const_cast<char*>(string);
    char* last     = haystack + n;

    const uint32_t prf   = *(uint32_t*)needle; // the first 4 bytes of needle
    const __m512i prefix = _mm512_set1_epi32(prf);

    next = _mm512_loadu_si512(haystack);

    for (/**/; haystack < last; haystack += 64) {

        curr = next;
        next = _mm512_loadu_si512(haystack + 64);
        const __m512i shft = _mm512_alignr_epi32(next, curr, 1);

        v0 = curr;

        {
            const __m512i t1 = _mm512_srli_epi32(curr, 8);
            const __m512i t2 = _mm512_slli_epi32(shft, 24);
            v1 = _mm512_or_si512(t1, t2);
        }
        {
            const __m512i t1 = _mm512_srli_epi32(curr, 16);
            const __m512i t2 = _mm512_slli_epi32(shft, 16);
            v2 = _mm512_or_si512(t1, t2);
        }
        {
            const __m512i t1 = _mm512_srli_epi32(curr, 24);
            const __m512i t2 = _mm512_slli_epi32(shft, 8);
            v3 = _mm512_or_si512(t1, t2);
        }

        uint16_t m0 = _mm512_cmpeq_epi32_mask(v0, prefix);
        uint16_t m1 = _mm512_cmpeq_epi32_mask(v1, prefix);
        uint16_t m2 = _mm512_cmpeq_epi32_mask(v2, prefix);
        uint16_t m3 = _mm512_cmpeq_epi32_mask(v3, prefix);

        int index = 64;
        while (m0 | m1 | m2 | m3) {
            if (m0) {
                int pos = __builtin_ctz(m0) * 4 + 0;
                m0 = m0 & (m0 - 1);

                if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) {
                    index = pos;
                }
            }

            if (m1) {
                int pos = __builtin_ctz(m1) * 4 + 1;
                m1 = m1 & (m1 - 1);

                if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) {
                    index = pos;
                }
            }

            if (m2) {
                int pos = __builtin_ctz(m2) * 4 + 2;
                m2 = m2 & (m2 - 1);

                if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) {
                    index = pos;
                }
            }

            if (m3) {
                int pos = __builtin_ctz(m3) * 4 + 3;
                m3 = m3 & (m3 - 1);

                if (pos < index && memcmp(haystack + pos + 4, needle + 4, k - 4) == 0) {
                    index = pos;
                }
            }
        }

        if (index < 64) {
            return (haystack - string) + index;
        }
    }

    return size_t(-1);
}

Algorithm 3: SSE4.2-specific (PCMPESTRM)

Algorithm

SSE4.2 introduced String and Text New Instructions (STNI), a set of very complex instructions that were meant to be a building block for string operations. Unfortunately, Intel practically discontinued STNI in newer processors, hasn't introduced AVX2 versions of STNI and make them extremely slow (11 cycles latency is unacceptable).

Basically PCMPxSTRx instruction exists in four variants, which differs only in:

  1. A way to determine string length: the length might be given explicitly or the first zero byte marks string end, as in traditional C strings.
  2. How the result is saved, it might be either a bit-mask/byte-mask or the number of first/last bit set in the bit-mask.

Additional instruction's argument (immediate constant) defines several aspects of execution, specifically the algorithm of comparison. There are four different algorithms available, one we're using is called range ordered. Despite the name, this algorithm locates substring, or its prefix if a substring goes beyond register width.

For example, when we're searching "ABCD" in "ABCD_ABC_ABCD_AB" the instruction returns bitmask 0b0100001000000001, treating suffix "AB" as a match. Thus we can safely assume that only the first character matches, as tail might or might not be present in a register. (Of course it can be determined, but require additional calculations which is not very handy.)

Below is a code snippet which does the above operation.

const char* s1 = "ABCD_ABC_ABCD_AB";
const char* s2 = "ABCD____________";

const int mode = _SIDD_UBYTE_OPS
               | _SIDD_CMP_EQUAL_ORDERED
               | _SIDD_BIT_MASK;

const __m128i N   = _mm_loadu_si128((__m128i*)(s2));
const __m128i D   = _mm_loadu_si128((__m128i*)(s1));
const __m128i res = _mm_cmpestrm(N, 4, D, 16, mode);
uint64_t mask = _mm_cvtsi128_si64(res); // = 0b0100001000000001

Implementation

SSE

size_t sse42_strstr_anysize(const char* s, size_t n, const char* needle, size_t k) {

    const __m128i N = _mm_loadu_si128((__m128i*)needle);

    for (size_t i = 0; i < n; i += 16) {

        const int mode = _SIDD_UBYTE_OPS
                       | _SIDD_CMP_EQUAL_ORDERED
                       | _SIDD_BIT_MASK;

        const __m128i D   = _mm_loadu_si128((__m128i*)(s + i));
        const __m128i res = _mm_cmpestrm(N, k, D, n - i, mode);
        uint64_t mask = _mm_cvtsi128_si64(res);

        while (mask != 0) {

            const auto bitpos = bits::get_first_bit_set(mask);

            // we know that at least the first character of needle matches
            if (memcmp(s + i + bitpos + 1, needle + 1, k - 1) == 0) {
                return i + bitpos;
            }

            mask = bits::clear_leftmost_set(mask);
        }
    }

    return std::string::npos;
}

Performance results update

Performance of various SIMD implementations were measured. Test programs also have got specialisation for short substrings, that are selected at run time. Performance of a C strstr is included for comparison. I omitted C++ string::find due to performance bug in GNU libc which makes the method 10 times slower than strstr.

Test programs were run three times. Following computers, running either Debian or Ubuntu, were tested:

  1. Westmere i5 M540, GCC 6.2.0,
  2. Bulldozer FX-8150 CPU, GCC 4.8.4
  3. Haswell i7 4470, GCC 5.4.1,
  4. Skylake i7 6700, GCC 5.4.1,
  5. Knights Landing (KNL) 7210, GCC 5.3.0.
  6. ARMv7 (Raspberry Pi 3, 32-bit code), GCC 4.9.2
  7. ARMv8 (ARM Cortex A57 - AMD Opteron A1100, 64-bit code), Clang 3.8.0

x64 computers

procedure time in seconds
Westemere Bulldozer Haswell Skylake KNL
std::strstr 0.82246 9.37792 0.52786 0.66148 4.94606
std::string::find --- --- --- --- ---
SWAR 64-bit (generic) 2.49859 2.93836 1.57715 1.40404 8.17075
SSE2 (generic) 0.74589 0.78871 0.55435 0.48863 6.10786
SSE4.1 (MPSADBW) 1.45040 1.98863 0.89775 0.63875 18.71666
SSE4.1 (MPSADBW unrolled) 1.23849 2.06008 0.99647 0.87919 13.72486
SSE4.2 (PCMPESTRM) 1.69968 2.00681 1.55992 1.39063 6.28869
AVX2 (MPSADBW)     0.61578 0.56981 13.15136
AVX2 (generic)     0.38653 0.36309 4.09478
AVX512F (MPSADBW-like)         2.32616
AVX512F (generic)         1.14057
biggest speed-up to strstr 1.10 --- 1.37 1.82 4.33
  • Performance of strstr on the machine with Bulldozer is terrible, it'd pointless to use it as a reference.

ARM computers

procedure time in seconds
ARMv7 ARMv8
std::strstr 7.30405 3.37546
std::string::find 4.17131 1.81368
SWAR 64-bit (generic) 36.65012 0.46269
SWAR 32-bit (generic) 2.45058 0.81075
ARM Neon (32-bit, generic) 1.29861 0.40699
AArch64 (64-bit, generic) --- 0.27897
biggest speed-up to std::string::find 3.1 6.5

Conclusions and remarks

  • The generic SIMD algorithm outperforms C strstr on all platforms. An implementation should use the highest SIMD version available on a certain CPU.
  • MPSADBW performs pretty bad, with an exception for Skylake. Knights Landing performance is terrible.
  • PCMPESTRM performs worse than MPSADBW.
  • ARM Neon performance is pretty good even for SWAR implementation. The SWAR version is 1.7 times faster than string::find, SIMD version is 3.1 times faster.
  • AArch64 performance of scalar SWAR64 is almost as good as 32-bit SIMD procedure.
  • Comparison with strstr might be considered unfair, as the procedure deals with string of unknown length, while my implementations get lengths and take advantage of this. I fully agree.
  • Procedures I implemented are also unsafe, because might read data off the input string. This may lead to access violation if strings are located just before unmapped memory. And for sure address sanitizers will complain. Making the procedures safe is feasible, but it wasn't my goal.

Acknowledgments

Daniel Lemire has gave me access to Haswell, Skylake, KNL, Bulldozer and ARMv8 machines, where I compiled and run test programs. Thank you!

Source code

All implementations and tests programs are available at github.

History