Author: | Wojciech Muła |
---|---|

Added on: | 2015-04-08 |

Updates: | 2016-05-03 (binary search modifications) |

Contents

The problem: there is an **ordered dictionary** containing only
**unique** keys. The dictionary is read only, and keys are 32-bit (SSE) or
64-bit (AVX2).

The obvious solution is to use binary search. Keys can be stored in a contiguous memory thanks to that there is no internal fragmentation, and data has cache locality. And of course indexing the keys is done in constant time (in the terms of computational complexity) or a single memory fetch (hardware).

The time complexity of binary search is O(log_{2}(*n*)), i.e. for one
million elements single lookup takes up to 20 operations. Single
operation is fetching a value and comparing with the given key.

Another algorithm is linear search which seems to be suitable
for **small dictionaries**. Linear search could be easily SIMD-ized.

I propose following modification: the key space is split to **five**
independent subspaces and the lookup starts with choosing the proper subspace,
then standard binary is used to search in the subspace. The computational complexity
remains O(log_{2}*n*), but the exact number of operations is smaller.

+----------+----+----------+----+----------+----+----------+----+----------+ | | k1 | | k2 | | k3 | | k4 | | +----------+----+----------+----+----------+----+----------+----+----------+ subspace 0 | subspace 1 | subspace 2 | subspace 3 | subspace 4

The trick is to use SIMD operations to choose a subspace. We need
**four** keys `k1`, `k2`, `k3` and `k4` which define subspace
boundaries, then a **single** vector compare for greater yields a mask that
defines a subspace.

Following pseudo code show the idea:

function search(key) is begin -- SIMD part key_vector = {key, key, key, key}; -- 1 boundaries = {k1, k2, k3, k4}; dword_mask = PCMPGT(key_vector, boundaries); -- 2 bit_mask = MOVMSKPS(dword_mask); -- 3 -- plain binsearch if bit_mask == 0b0000 then -- key <= k1 search in subspace 0 elif bit_mask == 0b0001 then -- key > k1 and key <= k2 search in subspace 1 elif bit_mask == 0b0011 then -- key > k2 and key <= k3 search in subspace 2 elif bit_mask == 0b0111 then -- key > k3 and key <= k4 search in subspace 3 elif bit_mask == 0b1111 then -- key > k4 search in subspace 4 end if end

SIMD part requires just three instructions:

- broadcast
`key`to vector; - compare with
`boundaries`vector; - get mask.

This is done in **constant time** and binary search is executed on
subarray of size 20% of input data.

Modification of binary search: when the search range is narrowed, then
try to find a key **near** the pivot point. A SIMD equality function
is used.

Following code shows the idea (check out the implementation):

key_vector = {key, key, key, key} while (a <= b) { const int c = (a + b)/2; if (data[c] == key) { return c; } if (key < data[c]) { b = c - 1; if (b >= 4) { SIMD_equal(data[b - 4 .. b], key_vector) => index return index if found } } else { a = c + 1; if (a + 4 < limit) { SIMD_equal(data[a .. b + 4], key_vector) => index return index if found } } }

For short arrays linear search is faster than binary search. In this modification binary search algoithm ends when the search range is small, then linear search is performed on this range.

Threshold size shouldn't be too large. I would say that the size of cache line (or two lines) is a good starting point.

A full C++ implementation:

int binsearch(uint32_t key, int a, int b) const { while (b - a > 64/4) { const int c = (a + b)/2; if (data[c] == key) { return c; } if (key < data[c]) { b = c - 1; } else { a = c + 1; } } for (int i=a; i <= b; i++) { if (data[i] == key) { return i; } } return -1; }

Results from Core i5 (Westmere). The unit is number of queries per second.

size [bytes] | binary search | linear search | ||||
---|---|---|---|---|---|---|

(1) default | (2) linear fallback | (3) SIMD subrange select | (4) SIMD search around pivot | (5) default | (6) SIMD boosted | |

4 | 44,277,175 | 46,899,696 | 33,078,081 | 40,488,781 | 64,679,288 |
40,058,806 |

8 | 16,391,723 | 18,863,227 | 16,241,043 | 15,295,572 | 22,813,654 | 27,207,701 |

16 | 3,617,123 | 4,131,163 | 6,905,724 | 6,622,157 | 3,266,917 | 10,469,538 |

32 | 1,444,317 | 1,784,175 | 1,926,374 | 2,437,853 | 1,268,942 | 4,130,405 |

64 | 586,026 | 669,683 | 677,376 | 822,112 | 456,542 | 1,504,841 |

128 | 259,413 | 279,583 | 269,133 | 337,367 | 139,831 | 442,554 |

256 | 114,956 | 123,168 | 117,368 | 134,691 |
40,075 | 132,586 |

512 | 49,428 | 56,470 | 53,368 | 59,880 |
10,767 | 38,690 |

1024 | 23,527 | 25,907 | 25,402 | 27,442 |
2,788 | 10,141 |

2048 | 10,873 | 12,048 | 11,193 | 12,640 |
707 | 2,631 |

4096 | 5,520 | 5,632 | 5,091 | 5,999 |
178 | 692 |

8192 | 2,738 | 2,656 | 2,531 | 2,852 |
45 | 174 |

- Linear search is suitable for small dictionaries (up to 64-128 elements).
- SIMD-ized version of binary search (3) is always faster than plain version (1).
- Binary search with linear search fallback (2) is almost as fast as the fastest binary SIMD search (4).

All sources are available at github.