GNU std::string::find is very slow

Author:Wojciech Muła
Added on:2016-10-08
Updated on:2016-11-27 (tidy), 2016-10-17 (link to GCC's bugtrack)

Today's weather was a really good excuse for staying at home and hacking. I come back to my old idea of speeding up substring searching exploiting SSE4.1 instruction MPSADBW. Finally I translated my old SSE code into AVX2. Then I transplanted my recent AVX512F code which employs similar techniques. In the end I coded AVX512F and SSE2 versions of SWAR-friendly technique which I developed some time ago. (Github repository contains all the stuff, if you wish to take a look.)

But let's put aside vector algorithms performance. Experiments have revelead that method find from C++ std::string is incredibly slow. The method can be slower an order of magnitude than strstr, and it doesn't get better with a newer stdlib — GCC versions varies from 4.9.2 (Debian) to 5.4.0 (Ubuntu).

Here is the table I took from the repository's README. The slowdown factors, when compared with strstr, are: 12.9, 12.6, 9.9, 7.4.

architecture procedure time in seconds
strstr string::find
Westemere 0.83 10.71
Haswell 0.48 6.06
Skylake 0.66 6.54
KNL 4.96 36.94

I looked into headers, and I suppose located the culprit in bits/basic_string.tcc. Maybe I'm wrong, but for me it's a linear search, isn't it?

find(const _CharT* __s, size_type __pos, size_type __n) const
  __glibcxx_requires_string_len(__s, __n);
  const size_type __size = this->size();
  const _CharT* __data = _M_data();

  if (__n == 0)
    return __pos <= __size ? __pos : npos;

  if (__n <= __size)
      for (; __pos <= __size - __n; ++__pos)
        if (traits_type::eq(__data[__pos], __s[0])
            && traits_type::compare(__data + __pos + 1,
                                    __s + 1, __n - 1) == 0)
          return __pos;
  return npos;

The assembly dump taken from perf annotate:

std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::find

78:┌─→add    $0x1,%rbx
   │  add    $0x1,%rbp
   │  cmp    %r15,%rax
   │↓ ja     b8
85:│  cmp    -0x1(%rbp),%r14b
   │  lea    -0x1(%rbx),%r13
   │  mov    %rbx,%rax
   │↑ jne    78
   │  test   %r12,%r12
   │↑ je     25
   │  mov    0x8(%rsp),%rsi
   │  mov    %r12,%rdx
   │  mov    %rbp,%rdi
   │→ callq  memcmp@plt
   │  test   %eax,%eax
   │↑ je     25
   │  mov    %rbx,%rax
   └──jmp    78

It must be slow. And I think C++ users should be worried a little bit.

See also

The problem has been already reported (bug 66414).