| Author: | Wojciech Muła, wojciech_mula@poczta.onet.pl |
|---|---|
| Last update: | 2011-04-13 |
Contents
Module ahocorasick contains several constants and class Automaton.
Automaton class is pickable (implements __reduce__()).
One of values:
Kind is maintained internally by Automaton object. Some methods are not available when automaton kind is EMPTY or isn't an AHOCORASICK. When called then exception is raised, however testing this property could be better (faster, more elegant).
Constructor accepts just one argument, a type of values, one of constants:
Returns iterator that iterate through words.
If prefix (a string) is given, then only words sharing this prefix are yielded.
If wildchar (single character) is given, then prefix is treated as a simple pattern with selected wildchar. Optional parameter how controls which strings are matched:
See Example 2.
Add new word, a key, to dictionary and associate with value. Returns True if word didn't exists earlier in dictionary.
If store == STORE_LENGTH then value is not allowed --- len(word) is saved.
If store == STORE_INTS then value is optional. If present, then have to be an integer, otherwise defaults to len(automaton).
If store == STORE_ANY then value is required and could be any object.
This method invalidates all iterators only if new word was added (i.e. method returned True).
Removes all words from dictionary.
This method invalidates all iterators.
Creates Aho-Corasick automaton based on trie. This doesn't require additional memory. After successful creation kind become AHOCORASICK.
This method invalidates all iterators.
Perform Aho-Corsick on string; start/end can be used to reduce string range. Callback is called with two arguments:
(Method called with start/end does similar job as find_all(string[start:end], callback), except index values).
Returns iterator (object of class AutomatonSearchIter) that does the same thing as find_all, yielding tuples instead of calling a user function.
find_all method could be expressed as:
def find_all(self, string, callback):
for index, value in self.iter(string):
callback(index, value)
Returns 3 lists describing a graph:
ID is a unique number and a label is a single byte.
Module package contains also program dump2dot.py that shows how to convert dump results to input file for graphviz tools.
Returns dictionary containing some statistics about underlaying trie:
Class isn't available directly, object of this class is returned by iter method of Automaton. Iterator has additional method.
Sets new string to process. When reset is False (default), then processing is continued, i.e internal state of automaton and index aren't touched. This allow to process larger strings in chunks, for example:
it = automaton.iter(b"")
while True:
buffer = receive(server_address, 4096)
if not buffer:
break
it.set(buffer)
for index, value in it:
print(index, '=>', value)
When reset is True then processing is restarted. For example this code:
for string in set:
for index, value in automaton.iter(string)
print(index, '=>', value)
Does the same job as:
it = automaton.iter(b"")
for string in set:
it.set(it, True)
for index, value in it:
print(index, '=>', value)
>>> import ahocorasick
>>> A = ahocorasick.Automaton()
# add some words to trie
>>> for index, word in enumerate("he her hers she".split()):
... A.add_word(word, (index, word))
# test is word exists in set
>>> "he" in A
True
>>> "HER" in A
False
>>> A.get("he")
(0, 'he')
>>> A.get("she")
(3, 'she')
>>> A.get("cat", "<not exists>")
'<not exists>'
>>> A.get("dog")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError
>>>
# convert trie to Aho-Corasick automaton
A.make_automaton()
# then find all occurrences in string
for item in A.iter("_hershe_"):
... print(item)
...
(2, (0, 'he'))
(3, (1, 'her'))
(4, (2, 'hers'))
(6, (3, 'she'))
(6, (0, 'he'))

Demonstration of keys behaviour.
>>> import ahocorasick
>>> A = ahocorasick.Automaton()
# add some words to trie
>>> for index, word in enumerate("cat catastropha rat rate bat".split()):
... A.add_word(word, (index, word))
# prefix
>>> list(A.keys("cat"))
["cat", "catastropha"]
# pattern
>>> list(A.keys("?at", "?", ahocorasick.MATCH_EXACT_LENGTH))
["bat", "cat", "rat"]
>>> list(A.keys("?at?", "?", ahocorasick.MATCH_AT_MOST_PREFIX))
["bat", "cat", "rat", "rate"]
>>> list(A.keys("?at?", "?", ahocorasick.MATCH_AT_LEAST_PREFIX))
["rate"]

Type of strings accepted and returned by Automaton methods can be either unicode or bytes, depending on compile time settings (preprocessor definition AHOCORASICK_UNICODE). Value of module member unicode informs about chosen type.
Ostrzeżenie
If unicode is selected, then trie stores 2 or even 4 bytes per letter, depending on Python settings. If bytes are selected, then just one byte per letter is needed.
Library is licensed under very liberal two-clauses BSD license. Some portions has been realased into public domain.
Full text of license is available in LICENSE file.