Subscribe to Stuck in an Infiniteloop        RSS Feed
-----

An In-Depth Look At Radix Trees

Icon Leave Comment
Radix Trees

Continuing in the same solution space as the Trie, we now look at Radix Trees:

Quote

a radix tree is a data structure that represents a space-optimized trie in which each node that is the only child is merged with its parent. The result is that the number of children of every internal node is at least the radix r of the radix tree, where r is a positive integer and a power x of 2, having x ≥ 1. Unlike in regular tries, edges can be labeled with sequences of elements as well as single elements. This makes radix trees much more efficient for small sets (especially if the strings are long) and for sets of strings that share long prefixes.


In layman's terms, instead of each node containing a single letter and a "chain" of letters leading to a word, we can have many letters in a node and each sequence is rooted at its common prefix. As with the trie, there is an "empty" root node that references the first layer of all the subtrees based on common prefix.

class RadixTree {
private:
  struct Node {
    Node() = default;
    Node(std::string str, Node* parent) noexcept
      : rep{str}, parent{parent} {}
    Node(const Node&) = delete;
    Node& operator=(const Node&) = delete;
    bool print{false};
    std::string rep{""};
    Node* parent{nullptr};
    std::vector<std::unique_ptr<Node>> children{};
    void print_helper(std::ostream& os) const noexcept;
    void lookup_helper(std::vector<std::string>& list, const size_t index, const std::string& word) const noexcept;
    void word_list_helper(std::vector<std::string>& list) noexcept;
  };  
  
public:
  RadixTree() : _root{new Node}, _size{0} {}; 
  size_t size() const noexcept { return _size;}

  void insert(const std::string& item);
  void remove(const std::string& item);
  void lookup(const std::string& item) noexcept;
  std::vector<std::string> word_list() noexcept;
  friend std::ostream& operator<<(std::ostream& os, const RadixTree& rhs);

private:
  bool exists(const std::string& item, std::unique_ptr<Node>** node) noexcept;
  void word_list_helper(std::vector<std::string>& list) noexcept;
  std::unique_ptr<Node> _root;
  size_t _size;
};



Note: the code contained here uses a few notations to indicate status within the tree. After the contents of a node "[1|0]" indicates whether or not the node itself is an inserted word respectively. Indentation with a "->" indicates children nodes.

As an initial case, consider waterfall and water inserted in that order into a radix tree:

1 words in radixTree:
waterfall[1]

2 words in radixTree:
water[1]
  -> fall[1]



Waterfall is inserted in its entirety, then when we go to insert water, we notice that the largest common prefix between the two words is the sequence "water". This becomes the new root of this subtree and "fall" is added as a child node.

The more words that are inserted into the tree, the smaller the root nodes get in terms of content.

Inserting the words test and toast results in the following sequence:

t[0]
  -> est[1]
  -> oast[1]



The most common prefix between toast and test is a single "t". In this regard a radix tree may begin to regress towards the trie.

Insertion

Insertion works on the principle of searching for matches until we either hit the end of the subtree of a match, as far as we can go in a subtree node, or we have zero matches and it is an entire word insertion into a node.

Iterate through the current node's letter, noting matches, once we no longer match, begin looking at children. If we find a match in the first letter of a child, we go into the subtree and repeat our search. Once we arrive at the correct location for insertion, we must evaluate the node we are at. If there are no matches, insert the entire word as a child node. Otherwise split the current node representation based on the total matched length (at that level of the subtree). We derive the prefix, suffix, and a "new" suffix, if applicable.

bool RadixTree::exists(const std::string& item, std::unique_ptr<Node>** node = nullptr) noexcept {
  std::unique_ptr<Node>* cur_node = &_root;
  size_t total_matched_len = 0;
  size_t match_index = 0;
  bool matched_child = false;
  std::string s_item{item.substr(total_matched_len)};
  while(*cur_node != nullptr && total_matched_len < item.length()){
    s_item = item.substr(total_matched_len);
    for(auto&& letter : (*cur_node)->rep){
      if(letter == s_item[match_index]){
        total_matched_len++;
        match_index++;
      }
      else break;
    }
    for(auto&& child : (*cur_node)->children){
      if(child->rep[0] == s_item[match_index]){
        cur_node = &child;
        matched_child = true;
        break;
      }
    }
    if(!matched_child) break;
    match_index = 0; //reset for child node
    matched_child = false;
  }
  //for remove grab the handle to the node we found
  if (node != nullptr) *node = &(*cur_node);
  //did we match all the letter of the search string and does the final substr match the rep (aka suffix match)
  //the latter is important in the case of a longer word being inserted before a shorter word: waterfall, water
  //it is also possible that a word is in the tree, but was not inserted, example insert peas, search for pea 
  return ((*cur_node)->print && total_matched_len == item.length() && s_item.length() == (*cur_node)->rep.length());
}

void RadixTree::insert(const std::string& item) {
  if (exists(item)){
    std::cout << item << " is already in RadixTree." << std::endl;
    return;
  }
  std::unique_ptr<Node>* cur_node = &_root;
  Node* parent = cur_node->get();
  size_t matched_len = 0, total_matched_len = 0;
  do {
    std::string s_item{item.substr(total_matched_len)};
    for(size_t i = 0; (i < s_item.length() && i < (*cur_node)->rep.length()); i++){
      if((*cur_node)->rep[i] == s_item[i]){
        matched_len++;
        total_matched_len++;
      }
      else break; //if we hit a non matching letter, immediately start looking at children
    }
    bool matched_child = false;
    for(auto&& child : (*cur_node)->children){
      //std::cout << "searching for " << item << " :: cnm: " << child->rep[0] << ":" << s_item[matched_len] << std::endl;
      if(child->rep[0] == s_item[matched_len]){//only one child, if any, will match
        //std::cout << "cnm matched, moving to: " << child->rep << std::endl;
        matched_child = true;
        parent = cur_node->get();
        cur_node = &child;
        break;
      }
    }
    //two exit possibilities, we either find the entire word across nodes OR we hit the end of our search
    if(total_matched_len == item.length()) break;
    if(!matched_child) break;
    matched_len = 0; //reset for next layer
  } while(true);

  //std::cout << "searching for " << item << " :: total matches: " << total_matched_len << " ml=" << matched_len << std::endl;
  if (total_matched_len == 0){ //base case: insert the entire word as a new node
    //std::cout << "cn=" << (*cur_node)->rep << " child=" << item << std::endl;
    (*cur_node)->children.push_back(std::unique_ptr<Node>(new Node(item, parent)));
    (*cur_node)->children.back()->print = true;
  }
  else {
    //in this case we either matched a whole word on one or more nodes OR we ran out of things to look at and must insert
    std::unique_ptr<Node>* back = &(*cur_node);
    std::string prefix{(*back)->rep.substr(0, matched_len)}; //root to split on
    std::string suffix{(*back)->rep.substr(matched_len)}; //existing suffix, if any
    std::string new_suffix{item.substr(total_matched_len)}; //new suffix, if any
    //std::cout << "back=" << (*back)->rep << " p=" << prefix << " s=" << suffix << " ns=" << new_suffix << " ml=" << matched_len << std::endl;
    (*back)->rep = prefix;

    if(suffix.length() > 0){
      std::unique_ptr<Node> tmp(new Node(suffix, (*back).get()));
      for(auto&& node : (*back)->children){
        node->parent = tmp.get();
        tmp->children.push_back(std::move(node));
      }
      (*back)->children.clear();
      (*back)->children.push_back(std::move(tmp));
      (*back)->children.back()->print = (*back)->print;
    }
    //edge cases where the original item was printable or not...
    if(new_suffix.length() > 0){
      (*back)->children.push_back(std::unique_ptr<Node>(new Node(new_suffix, (*back).get())));
      (*back)->children.back()->print = true;
      if (suffix.length() > 0) (*back)->print = false;
    }
    //after fixing up the children nodes, check to see if we matched a whole word on a subset of an existing node and mark accordingly
    if(total_matched_len == item.length()) (*back)->print = true;
  }
  _size++;
}



Consider the case of inserting the words peas and peanut in our existing "waterfall" tree:

3 words in radixTree:
water[1]
  -> fall[1]
peas[1]

4 words in radixTree:
water[1]
  -> fall[1]
pea[0]
  -> s[1]
  -> nut[1]



This is a case of the most common prefix not being an inserted word and will generate both a suffix and a new suffix. Here the suffix is "s" of the existing word "peas" and "nut" is the new suffix.

To illustrate all insertion mechanisms, consider the insertion of test, toaster, toasting, toasters:

test[1]

t[0]
  -> est[1]
  -> oaster[1]

t[0]
  -> est[1]
  -> oast[0]
    -> er[1]
    -> ing[1]

t[0]
  -> est[1]
  -> oast[0]
    -> er[1]
      -> s[1]
    -> ing[1]



Deletion

There are three cases to consider when deleting a word from a radix tree:

Quote

1. no children, kill node
2. one child, merge, kill child
3. 2+ children, this means we are "deleting" a subtree root, but there exists longer words, mark print false



When a word is deleted, the tree will make an effort to roll up a single layer of nodes, if applicable. What does this mean? Let's remove water from our existing tree.

The subtree looks like this:

water[1]
  -> fall[1]



When we delete water, the tree notices that there is only a single child on water, and moves the string up and clears any children:

waterfall[1]



//see the exists function in the insert code block above
void RadixTree::remove(const std::string& item){
  std::unique_ptr<Node>* cur_node;
  if (!exists(item, &cur_node)){
    std::cout << item << " not in RadixTree\n";
    return;
  }
  std::cout << "Removing " << item << "/" << (*cur_node)->rep << " [str/node_rep] from RadixTree\n";
  //note: this will not attempt to clean up two levels of nodes depending on order of delete
  //3 cases:
  //  1. no children, kill node
  //  2. one child, merge, kill child
  //  3. 2+ children, this means  we are "deleting" a subtree root, but there exists longer words, mark print false
  if ((*cur_node)->children.size() == 0){
    (*cur_node)->print = false;
     Node* parent = (*cur_node)->parent;
     parent->children.erase(std::remove_if(parent->children.begin(),
                                           parent->children.end(),
                                           [cur_node](std::unique_ptr<Node>& node) {return node->rep == (*cur_node)->rep;} ));
  }
  else if ((*cur_node)->children.size() == 1){
    (*cur_node)->rep.append((*cur_node)->children.back()->rep);
    (*cur_node)->children.clear();
  }
  else {
    (*cur_node)->print = false;
  }
  _size--;
}



Using the sample_dictionary.txt found in the conf directory of the repo for this code, the radix tree looks like this:

23 words in radixTree:
water[1]
  -> fall[1]
p[0]
  -> ea[0]
    -> s[1]
    -> nut[1]
  -> lay[1]
    -> e[0]
      -> r[1]
        -> s[1]
      -> d[1]
    -> pen[1]
  -> ot[1]
    -> tery[1]
      -> barn[1]
    -> s[1]
    -> head[1]
t[0]
  -> est[1]
  -> oast[0]
    -> er[1]
      -> s[1]
    -> ing[1]
ga[0]
  -> te[1]
  -> iter[1]
fail[1]
  -> ure[1]
  -> s[1]



or visually (as is tradition):

Posted Image

Searching and Suggestions

Like the trie, the radix tree can make suggestions based on a word given to it. This is how auto correct dictionaries function. If there is a direct match, it simply reports the match. Otherwise it will attempt to find similar words.

Since words are already subtree'd the search space is well defined assuming the user's first letter is correct. We recurse down the tree marking matches. Then if we have some "close" words we had it to the list of hits and report it back. In this case "close" means a series of character matches with the length within +/1 of the searched for term. In the case of our tree above, if we search for pea, it will suggest peas.Likewise searching for "waterfalls", returns "waterfall", and finally searching for "failed" results in a suggestion of "failure".

The code is set up for an extension of the search algorithms without any additions to the underlying tree structure. An understanding of the layout of the tree is all that is needed for more plug and play search algorithms.


void RadixTree::Node::lookup_helper(std::vector<std::string>& list, const size_t index, const std::string& word) const noexcept{
  int tmp_index = index;
  for(auto&& letter : rep){
    if (letter == word[tmp_index]){
      //std::cout << "t_i: " << tmp_index << std::endl;
      tmp_index++;
    }
    //else break; 
  }
  for(auto&& child : children){
    child->lookup_helper(list, tmp_index, word);
  }
  //if this is an actual word and we matched +/- 1 number of letters
  if (print && ((tmp_index <= (word.length() + 1)) && (tmp_index >= (word.length() - 1)))){
    Node* tmp = parent;
    std::string t_s = rep;
    //std::cout << "t_i: " << tmp_index << " rep: " << rep << std::endl;
    while(tmp != nullptr){
      std::string s_tmp = tmp->rep + t_s;
      t_s = s_tmp;
      tmp = tmp->parent;
    }
    //more culling, is the length of the resultant word for this slot +/- in length to the searched word?
    if((t_s.length() <= word.length() + 1) && (t_s.length() >= word.length() - 1))  list.push_back(t_s);
  }
}

void RadixTree::lookup(const std::string& item) noexcept {
  std::cout << "Searching for " << item << std::endl;
  if(exists(item)) {
    std::cout << "RadixTree contains: " << item << std::endl;
    return;
  }

  std::vector<std::string> hits{};
  for(auto&& node : _root->children){
    if(node->rep[0] == item[0]){
      node->lookup_helper(hits, 0, item);
    }
  }

  std::cout << item << " not found. Did you mean?\n";
  if (hits.size() == 0){
    std::cout << "\tNo suggestions found.\n";
    return;
  }

  std::sort(hits.begin(), hits.end());

  for(auto&& hit : hits){
    std::cout << "\t" << hit << " \n";
  }
  std::cout << std::endl;

}



Sample Driver:

#include "radix.h"

#include <fstream>

void load_dictionary(const std::string& file, RadixTree& radixTree){
  std::cout << "Loading test dictionary..." << std::endl;
  std::ifstream in(file);
  std::string item;
  while(std::getline(in, item)){
    radixTree.insert(item);
    std::cout << radixTree << std::endl;
  }
}

int main(){
  RadixTree radixTree;
  load_dictionary("conf/sample_dict.txt", radixTree);
  std::cout << radixTree << std::endl;
  radixTree.lookup("water");
  radixTree.lookup("waterfall");
  radixTree.lookup("follerskate");
  radixTree.lookup("pea");
  radixTree.lookup("peas");
  radixTree.lookup("peanut");
  radixTree.lookup("waterfalls");
  radixTree.lookup("tes");
  radixTree.lookup("failed");
  return 0;
}



-----------------------------------------------------

For the full code, see my GitHub repo of all these tutorials/blog posts.

-----------------------------------------------------
Happy coding!

0 Comments On This Entry

 

December 2017

S M T W T F S
     12
3456789
101112 13 141516
17181920212223
24252627282930
31      

Tags

    Recent Entries

    Search My Blog

    6 user(s) viewing

    6 Guests
    0 member(s)
    0 anonymous member(s)