TST -- Ternary search tree code

Back to site ¦ Download

* WHAT IS IT ? *

A ternary search tree (TST) is a nice and simple data structure which provides a way to store strings in order for fast searching. It also supports more advanced string queries. A TST can be seen as a hybrid between a binary search tree and a digital search trie. This package follows explanations given by Jon Bentley and Robert Sedgewick, see reference below.

Essentially each node holds a character, say c, and three child pointers (left, right, middle). The left child's character is less than c while the right child's character is greater than c. A TST is a series of binary trees connected by middle branches. Searching for a character string starts by searching for its first character in the binary tree obtained by following through left and right pointers (ignoring middle branches). If a node in this binary tree matches the character, we know that its middle sub-tree contains all the strings starting with this character and we proceed likewise with the next character.

In this C implementation we use special suffix nodes to save bytes of storage by branch compression. Memory requirements are still high, though. In the example below to locate "ghost" we locate the suffix node holding the string "host" and then one regular string comparison is enough to finish the search (`suffix' here has nothing to do with another data structure with many applications known as suffix tree ). We also apply the red-black tree balancing technique to the binary trees (following Cormen et al. ``Introduction to algorithms''). As a result, tail branches are not always as compressed as can be. Strings are mapped to void*.

FWIW: Since version 0.20, the package also comes with an implementation of the Aho-Corasick string searching algorithm in which TST's play the part of the keyword dictionary/automaton. To run an example, `[g]make ac-example'.

This is not a library, just TST code with room for improvement, released under the revised BSD license since version 0.20

Example: a TST for the 10 words {crypt, dog, doge, dogfood, doggone, dogma, ghost, golf, golfer, gopher}

                                    +--"pher"->0x0
                                    |
                                    |                 +--"er"->0x0
                                    |                 |
         +--[ g]--+--[.o]--+--[.l]--+--[.f]--+--[. ]--+-->0x0
         |                 |
         |                 +--"host"->0x0
         |
         |                          +--[ m]--+--"a"->0x0
         |                          |
+--[.d]--+--[.o]--+--[.g]--+--[.g]--+--"one"->0x0
         |                          |
         |                          |        +--"food"->0x0
         |                          |        |
         |                          +--[ e]--+--""->0x0
         |                                   |
         |                                   +--""->0x0
         |
         +--"crypt"->0x0

See file CHANGES

* REFERENCES *

* About TSTs *

Fast algorithms for searching and sorting strings
by: Jon Bentley and Robert Sedgewick
URL: http://www.cs.princeton.edu/~rs/strings/

Dr Dobb's Journal - Ternary Search Tree
by: Robert Sedgewick and Jon Bentley
URL: http://www.ddj.com/articles/1998/9804/

Implementation in plain C
by: Peter A. Friend
URL: http://www.octavian.org/cs/software.html

Implementation in C++ with Python bindings
by: Nicolas Lehuen
URL: http://www.python.org/pypi/pytst

Implementation in Java within iText library (com.lowagie.tex.pdf.hyphenation)
by: Carlos Villegas
URL: http://www.lowagie.com/iText/

* Aho-Corasick algorithm *

Aho A., Corasick M., Efficient string matching: an aid to bibliographic search, Communications of the ACM, 18(6), 333-340, 1975.

* Of related interest *

Text-Scan, perl extension module
by: Ira Joseph Woodhead
URL: http://search.cpan.org/~iwoodhead/

GNU libavl
by: Ben Pfaff
URL: http://adtinfo.org

AVL Tree Objects for Python
by: Sam Rushing
URL: ftp://squirl.nightmare.com/pub/python/python-ext/avl