Author: | Wojciech Muła |
---|---|

Last update: | 2013-09-30 |

Trie is a multiway tree where each edge is labelled by a letter;
such trees are used as dictionaries. Lookup takes O(*k*) time,
where *k* is a string length.

Trie node is defined in C:

typedef struct TrieNode { bool eow; // end of word marker int n; // children count char* letter; // list of edge labels TrieNode** children; // pointers to children nodes }

Lookup procedure is simple:

bool trie_lookup(TrieNode* root, const char* word, const int size) { TrieNode* node = root; for (int i=0; i < size; i++) { // go through edge labelled by i-th letter node = trie_next(node, word[i]); if (node == NULL) { return false; } } // we visited 'size' nodes, check if last // visited node is located at end-of-word return node ? node->eow : false; }

Function `trie_next` returns a child node labelled with given letter.

TrieNode trie_next(TrieNode node, char letter);

The implementation of this procedure determines overall searching and inserting time.

Following algorithms used by `trie_next` has been studied:

- linear search
- linear search with move-to-front strategy
- linear search with incremental move-to-front strategy
- linear search with minimized memory reads
- linear search optimized in SSE
- binary search

Implementation of `linear`:

TrieNode* trie_next(TrieNode* node, const char letter) { size_t i; for (i = 0; i < node->n; i++) { if (node->chars[i] == letter) { return node->next[i]; } } return NULL; }

This procedure performs at most `n` reads from `letters` array,
and one from `children` if letters has been found. Since reading
`letters` is done sequentially CPU caching helps.

Implementation of `linear-mtf`:

TrieNode* trie_next(TrieNode* node, const char letter) { for (size_t i = 0; i < node->n; i++) { if (node->chars[i] == letter) { if (i == 0) { return node->next[0]; } char c0 = node->chars[0]; TrieNode* n0 = node->next[0]; node->chars[0] = letter; node->next[0] = node->next[i]; node->chars[i] = c0; node->next[i] = n0; return node->next[0]; } } return NULL; }

Thanks to MTF item located at index 0 is at the fast path of procedure, hit rate was about 50% in my tests.

This procedure does 3 additional reads 4 additional writes on hit item with index greater than 0. These writes gives some penalty, but are amorized by fast-path execution.

Implementation of `linear-mtf-incr`:

TrieNode* trie_next(TrieNode* node, const char letter) { for (size_t i = 0; i < node->n; i++) { if (node->chars[i] == letter) { if (i == 0) { return node->next[0]; } const int prev = i-1; char c0 = node->chars[prev]; TrieNode* n0 = node->next[prev]; node->chars[prev] = letter; node->next[prev] = node->next[i]; node->chars[i] = c0; node->next[i] = n0; return node->next[prev]; } } return NULL; }

First variant of MTF allows to random fluctuating element located at the
fast-path, this variant — by moving item one position backward ---
keep at index 0 **most frequently** used items.

Implementation of `linear-unrolled`:

#define ROUND_UP(size) (4*(((size) + 3)/4)) TrieNode* trie_next(TrieNode* node, const char letter) { const uint32_t n = node->n; if (n == 0) { return NULL; } const uint32_t packed_byte = (letter & 0xff) * 0x01010101; for (size_t i = 0; i < n; i+=4) { const uint32_t dword = *((uint32_t*)&node->chars[i]); const uint32_t v = dword ^ packed_byte; if ((v & 0x000000ff) == 0) { return node->next[i + 0]; } if ((v & 0x0000ff00) == 0) { return node->next[i + 1]; } if ((v & 0x00ff0000) == 0) { return node->next[i + 2]; } if ((v & 0xff000000) == 0) { return node->next[i + 3]; } } return NULL; }

This procedure reads 4 bytes in one iteration (when run on 32-bit machine), and also some parallelism is involved — comparision is done for 4 bytes at once.

However, this procedure require array of length rounded up to 4 bytes, that costs additional memory.

Implementation of `sse`:

TrieNode* trie_next(TrieNode* node, const char letter) { const int n = node->n; if (n == 0) { return NULL; } if (n == 1) { if (node->chars[0] == letter) { return node->next[0]; } else { return NULL; } } int dummy; __asm__ __volatile__ ( "movzbl %%al, %%eax \n" "imul $0x01010101, %%eax, %%eax \n" "movd %%eax, %%xmm0 \n" "pshufd $0x00, %%xmm0, %%xmm0 \n" : "=a" (dummy) : "a" (letter) ); size_t i, j; for (i = 0; i < n; i+=16) { __asm__ __volatile__ ( "movdqa (%%eax), %%xmm1 \n" "pcmpeqb %%xmm0, %%xmm1 \n" "pmovmskb %%xmm1, %%eax \n" "bsf %%eax, %%eax \n" "setz %%ah \n" : "=a" (j) : "a" (&(node->chars[i])) ); if (j < 256) { return node->next[i+j]; } } return NULL; }

This is quite naive implementation, that uses obvious SSE instruction
like `pcmpeq` and `pmovmskb`. In one iteration 16 bytes are processed
in parallel.

The downside of this method is high setup time, that is really visible when arrays are short.

Implementation of `binary`:

TrieNode* trie_next(TrieNode* node, const char letter) { int a = 0; int b = node->n - 1; while (a <= b) { const int c = (a + b)/2; if (node->chars[c] == letter) { return node->next[c]; } if (letter < (int)node->chars[c]) { b = c - 1; } else { a = c + 1; } } return NULL; }

Procedure use straightforward binary search implementation (yes, it has integer overflow bug).

Test programs use two lists:

- dictionary — list of 99171 English words,
`/usr/share/dict/american-english`from Debian; - input words — list of 133861 English words, splitted text of "The Oddysey" from Project Gutenberg.

Each test program reads dictionary and build trie, then loads input
words into simple array. Finally for each input word call `trie_lookup`,
and this is repeated 100 times.

Tests were run on my old laptop with Penium M. Tests on recent CPU are coming.

There are 227979 nodes in trie.

len | count | 0 | 0 | 1 | 52 | 2 | 182 | 3 | 845 | === 4 | 3346 | ============== 5 | 6788 | ============================== 6 | 11278 | ================================================== 7 | 14787 | ================================================================== 8 | 15674 | ====================================================================== 9 | 14262 | =============================================================== 10 | 11546 | =================================================== 11 | 8415 | ===================================== 12 | 5508 | ======================== 13 | 3236 | ============== 14 | 1679 | ======= 15 | 893 | === 16 | 382 | = 17 | 176 | 18 | 72 | 19 | 31 | 20 | 10 | 21 | 3 | 22 | 5 | 23 | 1 |

len | count | 0 | 67480 | ====================================== 1 | 123210 | ====================================================================== 2 | 24418 | ============= 3 | 6432 | === 4 | 3104 | = 5 | 1407 | 6 | 646 | 7 | 359 | 8 | 235 | 9 | 145 | 10 | 124 | 11 | 83 | 12 | 58 | 13 | 48 | 14 | 48 | 15 | 29 | 16 | 25 | 17 | 24 | 18 | 19 | 19 | 22 | 20 | 16 | 21 | 14 | 22 | 11 | 23 | 3 | 24 | 5 | 25 | 4 | 26 | 2 | 27 | 0 | 28 | 1 | 29 | 1 | 30 | 3 | 31 | 1 | 32 | 0 | 33 | 0 | 34 | 0 | 35 | 1 | 36 | 0 | 37 | 0 | 38 | 0 | 39 | 0 | 40 | 0 | 41 | 0 | 42 | 0 | 43 | 0 | 44 | 0 | 45 | 0 | 46 | 0 | 47 | 0 | 48 | 0 | 49 | 0 | 50 | 0 | 51 | 0 | 52 | 0 | 53 | 1 |

procedure | min [ms] | avg [ms] | max [ms] |
---|---|---|---|

linear | 4501 | 4517 | 4543 |

linear-mtf | 4386 | 4408 | 4445 |

linear-mtf-incr | 3923 |
3941 |
3963 |

linear-unrolled | 4111 | 4127 | 4151 |

sse | 3951 | 3970 | 4005 |

binary | 4400 | 4424 | 4441 |

As we see linear search is the worst, and binary search is not better. Unrolling loop as always helps, SSE instructions helps too.

But the most surprising is that simple linear search with "incremental" move-to-front strategy beats all methods. As Rob Pike said "Fancy algorithms are slow when n is small, and n is usually small." Proved!