Dictionary Search Algorithm

Just an interesting question

Page 1 of 1

8 Replies - 23381 Views - Last Post: 22 November 2007 - 09:02 AM

#1 aceofspades686  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 6
  • View blog
  • Posts: 334
  • Joined: 08-October 07

Dictionary Search Algorithm

Posted 16 October 2007 - 05:38 AM

This train of thought was brought on by this thread in VB.NET.

I'm curious as to how one would accomplish a dictionary search for things like spell check and such. Assuming they were stored in a text file (just as an example) for smaller dictionaries(~50,000 words), it would be seemingly simple enough to cycle through all the words each time, but that seems like it would be incredibly slow especially considering that you may get up to 400,000 or 500,000 words eventually.

In short the only way I've been able to come up with would be searching on a per-letter basis. For example, look at the word typed, if it starts with "a" go to the list of words starting with "a". If the second letter was "e" go to the words within the "a" list that the next letter was "e" and so forth until you either run out of matches or you get an exact match.

Would this be a viable search algorithm? Or is there a cleaner and faster way of going about this?

Just more a question out of curiosity than anything, I haven't done much research on it but I'm curious how others would go about this.

Is This A Good Question/Topic? 0
  • +

Replies To: Dictionary Search Algorithm

#2 Louisda16th  Icon User is offline

  • dream.in.assembly.code
  • member icon

Reputation: 15
  • View blog
  • Posts: 1,967
  • Joined: 03-August 06

Re: Dictionary Search Algorithm

Posted 16 October 2007 - 06:02 AM

Here are two wikipedia links that might prove useful :).
CLICK HERE
AND HERE
Was This Post Helpful? 0
  • +
  • -

#3 aceofspades686  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 6
  • View blog
  • Posts: 334
  • Joined: 08-October 07

Re: Dictionary Search Algorithm

Posted 16 October 2007 - 06:13 AM

Hm, interesting. I never would've thought about an interpolation search for a dictionary, but it makes sense to an extent. Which of these would prove to be the faster in an instance like this? Or is the difference negligible?
Was This Post Helpful? 0
  • +
  • -

#4 Louisda16th  Icon User is offline

  • dream.in.assembly.code
  • member icon

Reputation: 15
  • View blog
  • Posts: 1,967
  • Joined: 03-August 06

Re: Dictionary Search Algorithm

Posted 16 October 2007 - 06:21 AM

LOL. I just searched wikipedia :). Im not very sure. Sorry.
Was This Post Helpful? 0
  • +
  • -

#5 realNoName  Icon User is offline

  • D.I.C Regular

Reputation: 7
  • View blog
  • Posts: 343
  • Joined: 04-December 06

Re: Dictionary Search Algorithm

Posted 16 October 2007 - 07:51 PM

Peter Norvig (Director of Research Google) has an essay on his site about how to write a spell check... might be a good read


http://www.norvig.co...ll-correct.html
Was This Post Helpful? 0
  • +
  • -

#6 PsychoCoder  Icon User is offline

  • Google.Sucks.Init(true);
  • member icon

Reputation: 1641
  • View blog
  • Posts: 19,853
  • Joined: 26-July 07

Re: Dictionary Search Algorithm

Posted 16 October 2007 - 08:10 PM

Actually, what I would do is load the words into a serialized object and hold it in that state, then as I need it search the object and not the actual text file. When loading it I would load it via its own BackgroundWorker Thread so as to not interrupt the normal processing of the application.

While I'm doing threading anyways, I would have said background worker on its own timer to initialize the reloading of the words on a given increment, in the event a word was added since the last time the object was loaded, and since it is a background thread it wouldn't have any effect on the performance of the application, well not enough to notice really.

But thats just an idea that bounced around inside my head when I read this :)
Was This Post Helpful? 0
  • +
  • -

#7 Programmist  Icon User is offline

  • CTO
  • member icon

Reputation: 252
  • View blog
  • Posts: 1,833
  • Joined: 02-January 06

Re: Dictionary Search Algorithm

Posted 19 October 2007 - 08:22 AM

that would work. If the dictionary got too big (millions), and you didn't want it to take up too much memory, then you might want to think about a local index/db and possibly build a local cache, based on recent activity. But 500,000 words (using unicode encoding at an average of 10 characters per word, a number which I'm pulling out of my a$$) would only set you back 10 MB, which is not much these days. Look at how much Firefox takes up. Right now mine is sitting at 128MB.
Was This Post Helpful? 0
  • +
  • -

#8 Kiriran  Icon User is offline

  • D.I.C Head

Reputation: 1
  • View blog
  • Posts: 60
  • Joined: 11-April 07

Re: Dictionary Search Algorithm

Posted 21 October 2007 - 02:30 AM

My professor once told me that you can do such a dictionary search with a special type of tree: The BK-Tree. But don't ask me how it work :P There's a PS document on the wiki. Just reading it and it's pretty good
Was This Post Helpful? 0
  • +
  • -

#9 nullonehalf  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 5
  • Joined: 07-November 07

Re: Dictionary Search Algorithm

Posted 22 November 2007 - 09:02 AM

Besides implementing your dictionary as a tree where words are represented by paths from the root node to a leaf, you could also use a hash table. Depending upon the collision resolution scheme you used, you could expect to find a given word in fewer comparisons on average than would be required to traverse the tree.

One effective hash function would be H(L1, L2, WL) := (L1*26 + L2)*10 + min(WL - 2, 9), where L1 and L2 are the first and second letters of the word, respective, and WL is it length; this is a suggestion from a rather old academic paper, "Computer Programs for Detecting and Correcting Spelling Errors" by James L. Peterson of the University of Texas, and was used to implement some very early spell-checking programs.

The hash function could also be improved by attempting to exploit known patterns in the English language and assumptions about the expected use and behavior of your program. Is it looking up every word in the document to see if it matches some known word in the dictionary? If this is the case, the hash table could be subdivided into several smaller hash tables organized with very common words (a, the, and, etc.) in the first hash table, and less common words in the subsequent tables, with these tables indexed (probably) by word length.

The possible advantage of the hash table over the tree is that your search will generally be O(1 + m/n), where m/n is the ratio of table entries to actual words; since the dictionary will probably be checking whole documents at a time, the hash table can be expected to run faster since it can deal with the large volume of common words much more quickly than can the tree, which will have to blindly each word from scratch for each token. Moreover, the actual word strings will be immediately available for use in the program, whereas any string operations in the tree representation will either require the string to be constructed from the path, or will require special tree operations.

The trade-off is that the table will almost certainly take up more storage than the tree.

'\0' .5
Was This Post Helpful? 1
  • +
  • -

Page 1 of 1