Subscribe to Stuck in an Infiniteloop        RSS Feed
-----

An In-Depth Look At Tries

Icon Leave Comment
Two years, one month, and 1 week since the last one of these. This time I don’t have an excuse, I really need to do these more often.

Tries

Quote

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.


Many moons ago, I wrote about suffix arrays. Consider the trie another solution in this space. Rather than slicing the ends off of a string and storing all of the permutations, words are inserted into the trie and available for dictionary type searching. I’m sure no one has seen one recently, but a dictionary used to be this giant paper book that listed all of the words in existence. In order to find one, you started with the first letter, got to its section, then used the second letter to find that section, etc… until you found what you were looking for or discovered that it wasn’t a word (at least according to that dictionary). These type of searches are often employed by auto correct and word suggestion algorithms.

This Trie implementation is simple: there is a root node that represents the empty string. The root has 26 Node pointers, one for each letter in the English language. Each pointer is initialized and itself has 26 letters, so on and so forth. Therefore, any word in English can be represented in this data structure in a clean way. NOTE: this implementation only handles lower case letters. It would be trivial to add support for initial uppercase letters (to represent formal English nouns).

For simplicity each node stores the string "so far" as represented by itself. Otherwise we would have to concatenate the string every time we ran through the trie.

The trie:

Attached Image

What this looks like only looking at one sequence of nodes:

Attached Image

When the end of a word is reached during insertion that node is marked so we know that is printable/an actual word. Otherwise we would have slices and slices of strings with no "end".

The code, trie.h:

#ifndef TRIE_H
#define TRIE_H

#include<iostream>
#include <memory>
#include <string>
#include <vector>
#include <algorithm>
#include <sstream>

class Trie {
private:
  struct Node {
    Node() = default;
    Node(char letter, Node* parent) noexcept
      : rep{(parent ? parent->rep + letter : std::string{letter, 1})},
        parent{parent} {}
    bool print{false};
    std::string rep{""};
    Node* parent{nullptr};
    std::array<std::unique_ptr<Node>, 26> children{{}};
    void print_helper(std::ostream& os) const noexcept;
    void lookup_helper(std::vector<std::string>& list, const size_t& depth) const noexcept;
    void first_letter_diff_look(std::vector<std::string>& list, const int index, const std::string& word) noexcept;
  };


public:
  Trie() : _root{new Node}, _size{0} {};
  size_t size() const noexcept { return _size;}

  bool exists(const std::string& item) noexcept;
  void insert(const std::string& item);
  void remove(const std::string& item);
  void lookup(const std::string& item) noexcept;
  friend std::ostream& operator<<(std::ostream& os, const Trie& rhs);

private:
  std::unique_ptr<Node> _root;
  size_t _size;
};


bool Trie::exists(const std::string& item) noexcept {
  Node* cur_node = _root.get();
  for(auto&& letter : item){
    cur_node = cur_node->children[letter-97].get();
    if (cur_node == nullptr) return false;
  }
  return cur_node->print;
}

void Trie::remove(const std::string& item){
  if (!exists(item)) return;

  std::unique_ptr<Node>* cur_node = &_root;
  for(auto&& letter : item){ //get to the end of the word
    cur_node = &((*cur_node)->children[letter-97]);
  } 

  for(auto&& node : (*cur_node)->children){
    if (node != nullptr){ //there exists some longer word, unmark print
      (*cur_node)->print = false;
      break;
    }
  }
  //does not remove the entire word, either marks it as not printable
  //(in the case of a smaller word: bat/bats with bat removed)
  //or it sets the pointer to null at the end, the other letters still exist
  if((*cur_node)->print) cur_node->reset(nullptr);
  _size--;
}

void Trie::insert(const std::string& item) {
  if (exists(item)){
    std::cout << item << " is already in trie." << std::endl;
    return;
  }

  std::unique_ptr<Node>* cur_node = &_root;
  Node* parent = cur_node->get();
  for(auto&& letter : item){
    cur_node = &((*cur_node)->children[letter-97]);
    if (*cur_node == nullptr) cur_node->reset(new Node(letter, parent));
    parent = cur_node->get();
  } 
  parent->print = true;
  _size++;
}

void Trie::Node::lookup_helper(std::vector<std::string>& list, const size_t& depth) const noexcept {
  for(auto&& node : children){
    if(node != nullptr){
      node->lookup_helper(list, depth);
    } 
  }

  if (print && ((rep.length() >= depth-1) && (rep.length() <= depth+1))){
    list.push_back(rep);
  }
}

void Trie::Node::first_letter_diff_look(std::vector<std::string>& list, const int index, const std::string& word) noexcept{
  if (rep.substr(1) == word) {
    list.push_back(rep);
    return;
  }
  if (children[word[index]-97] != nullptr) children[word[index]-97]->first_letter_diff_look(list, (index+1), word);
}

void Trie::lookup(const std::string& item) noexcept {
  std::cout << "Searching for " << item << std::endl;
  if(exists(item)) {
    std::cout << "Trie contains: " << item << std::endl;
    return;
  }

  //build list of possible suggestions from existing words
  const size_t str_len = item.length();

  std::vector<std::string> hits{};
  for(auto&& node : _root->children){
    if(node != nullptr && node->rep[0] == item[0]){
      node->lookup_helper(hits, str_len);
    }
  }

  //check for words that only different by the first letter
  //use _root rather than the first letter node
  const std::string& suffix_array = item.substr(1);
  for(auto&& node : _root->children){
    if(node != nullptr){
      if (node->children[suffix_array[0]-97] != nullptr){
        node->first_letter_diff_look(hits, 0, suffix_array); 
      }
    }
  }
    
  std::cout << item << " not found. Did you mean?\n";
  if (hits.size() == 0){
    std::cout << "\tNo suggestions found.\n";
    return;
  }

  std::sort(hits.begin(), hits.end());

  for(auto&& hit : hits){
    std::cout << "\t" << hit << " \n";
  }
  std::cout << std::endl;
}

void Trie::Node::print_helper(std::ostream& os) const noexcept {
  if (print) os << rep << "\n";
  for(auto&& node : children){
    if(node != nullptr){
      node->print_helper(os);
    } 
  }
}

std::ostream& operator<<(std::ostream& os, const Trie& rhs){
  std::cout << rhs._size << " words in trie:\n";
  for(auto&& node : rhs._root->children){
    if(node) node->print_helper(os);
  }
  return os;
}

#endif



Search Algorithms

This part is perpetually a work in progress. This is because now that we have the trie, we can write whatever algorithm we want. For example, included with this post are two algorithms:

1. Same starting letter and +/- 1 length of the searched word
2. Different starting letter, same suffix

Both utilize recursion of the helper function to move through the trie.

The driver for the class code:

#include<iostream>
#include <fstream>

#include "trie.h"

void load_dictionary(const std::string& file, Trie& trie){
  std::cout << "Loading test dictionary..." << std::endl;
  std::ifstream in(file);
  std::string item;
  while(std::getline(in, item)){
    trie.insert(item);
  }
}

int main(){
  Trie trie;
  load_dictionary("sample_dict.txt", trie);
  std::cout << trie << std::endl;
  trie.lookup("follerskate");
  trie.lookup("cart");
  trie.lookup("bart");
}




Running with a sample dictionary of 18 words:

Quote

apple
banana
bass
bastion
bat
bats
cat
fox
hat
mart
pizza
print
rollerskate
sand
sandwich
sandwiches
smart
socks


$./Trie

Quote

Searching for follerskate
follerskate not found. Did you mean?
rollerskate

Searching for cart
cart not found. Did you mean?
cat
mart

Searching for bart
bart not found. Did you mean?
bass
bat
bats
mart


One interesting limitation of this implementation is that pointers are not shared across letters. This makes suffix searching more complex than it needs to be. You have to start at the root and check each layer for a possibility and then begin checking the suffix.

--

Happy Coding!

0 Comments On This Entry

 

April 2017

S M T W T F S
            1
2345678
9101112131415
16171819202122
23242526272829
30            

Tags

    Recent Entries

    Search My Blog

    1 user(s) viewing

    1 Guests
    0 member(s)
    0 anonymous member(s)