| Author: | Wojciech Muła, wojciech_mula@poczta.onet.pl |
|---|---|
| Date: | 2011-04-13 |
PyDAWG is a python module implements DAWG graph structure, which allow to store set of strings and check existence of a string in linear time (in terms of a string length).
DAWG is constructed by incremental algorithm described in Incremental algorithm for sorted data, Jan Daciuk, Stoyan Mihov, Bruce Watson, and Richard Watson, Computational Linguistics, 26(1), March 2000. Prof. Jan Daciuk offers also some useful documentation, presentations and even sample code on his site.
The algorithm asserts that input words are sorted in lexicographic order; default Python sort() orders strings correctly.
Module pydawg provides class DAWG and following members:
DAWG class is picklable, and also provide independent way of marshaling with methods binload() and bindump().
Following values are possible:
Returns iterator that match words depending on word argument.
Yields words that match a pattern with given wildchar (wildchar matches any char). Parameter how controls which words are matched:
Don't allow to add any new words, state value become pydawg.CLOSED. Also free memory occupied by a hash table used to perform incremental algorithm (see also get_hash_stats()).
Can be reverted only by clear().
Class supports iter protocol, i.e. iter(DAWGobject) returns iterator, a lazy version of words() method.
Minimal perfect hashing (MPH) allows to find unique number representing any word from DAWG, and also find word with given number. Numbers are in always in range 1 ... len(DAWG).
Finally, this feature makes possible to perform fast lookups as in a regular dictionary.
Algorithm used for MPH is described in Applications of Finite Automata Representing Large Vocabularies, Claudio Lucchiesi and Tomasz Kowaltowski, Software Practice and Experience, 23(1), pp. 15--30, Jan. 1993.
MPH feature is enabled during compilation time if preprocessor definition DAWG_PERFECT_HASHING exists. Module member perfect_hashing reflects this setting.
Ostrzeżenie
Words numbering is done for the whole DAWG. If new words are added with add_word or add_word_unchecked, then current numbering is lost and when method word2index or index2word is called, then DAWG is renumbered.
Because of that frequent mixing these two groups of method will degrade performance.
D = pydawg.DAWG()
# fill DAWG with keys
for key in sorted(dict):
D.add_word_unchecked(key)
# prepare values array
V = [None] * len(D)
for key, value in dict.items():
index = D.word2index(key)
assert index is not None
V[index - 1] = value
# lookups are possible now
for word in user_input:
index = D.word2index(word)
if index is not None:
print(word, "=>", V[index - 1])
Returns sets describing DAWG, elements are tuples.
Node tuple:
Edge tuple:
Distribution contains program dump2dot.py that shows how to convert output of this function to graphviz DOT language.
Restore DAWG from binary data. Example:
import pydawg
A = pydawg.DAWG()
with open('dump', 'wb') as f:
f.write(A.bindump())
B = pydawg.DAWG()
with open('dump', 'rb') as f:
B.binload(f.read())
Returns dictionary containing some statistics about underlaying data structure:
Returns some statistics about hash table used by DAWG.
Approx memory occupied by hash table is table_size * element_size + items_count * item_size.
Type of strings accepted and returned by DAWG methods can be either unicode or bytes, depending on compile time settings (preprocessor definition DAWG_UNICODE). Value of module member unicode informs about chosen type.
Library is licensed under very liberal three-clauses BSD license. Some portions has been released into public domain.
Full text of license is available in LICENSE file.