Welcome to Dream.In.Code
Getting Help is Easy!

Join 136,105 Programmers for FREE! Get instant access to thousands of experts, tutorials, code snippets, and more! There are 1,688 people online right now. Registration is fast and FREE... Join Now!




Dictionary Search Algorithm

 
Reply to this topicStart new topic

Dictionary Search Algorithm, Just an interesting question

aceofspades686
16 Oct, 2007 - 04:38 AM
Post #1

D.I.C Regular
Group Icon

Joined: 8 Oct, 2007
Posts: 261


Dream Kudos: 100
My Contributions
This train of thought was brought on by this thread in VB.NET.

I'm curious as to how one would accomplish a dictionary search for things like spell check and such. Assuming they were stored in a text file (just as an example) for smaller dictionaries(~50,000 words), it would be seemingly simple enough to cycle through all the words each time, but that seems like it would be incredibly slow especially considering that you may get up to 400,000 or 500,000 words eventually.

In short the only way I've been able to come up with would be searching on a per-letter basis. For example, look at the word typed, if it starts with "a" go to the list of words starting with "a". If the second letter was "e" go to the words within the "a" list that the next letter was "e" and so forth until you either run out of matches or you get an exact match.

Would this be a viable search algorithm? Or is there a cleaner and faster way of going about this?

Just more a question out of curiosity than anything, I haven't done much research on it but I'm curious how others would go about this.
User is offlineProfile CardPM
+Quote Post

Louisda16th
RE: Dictionary Search Algorithm
16 Oct, 2007 - 05:02 AM
Post #2

 
Group Icon

Joined: 3 Aug, 2006
Posts: 1,790



Thanked: 1 times
Dream Kudos: 755
My Contributions
Here are two wikipedia links that might prove useful smile.gif.
CLICK HERE
AND HERE
User is offlineProfile CardPM
+Quote Post

aceofspades686
RE: Dictionary Search Algorithm
16 Oct, 2007 - 05:13 AM
Post #3

D.I.C Regular
Group Icon

Joined: 8 Oct, 2007
Posts: 261


Dream Kudos: 100
My Contributions
Hm, interesting. I never would've thought about an interpolation search for a dictionary, but it makes sense to an extent. Which of these would prove to be the faster in an instance like this? Or is the difference negligible?
User is offlineProfile CardPM
+Quote Post

Louisda16th
RE: Dictionary Search Algorithm
16 Oct, 2007 - 05:21 AM
Post #4

 
Group Icon

Joined: 3 Aug, 2006
Posts: 1,790



Thanked: 1 times
Dream Kudos: 755
My Contributions
LOL. I just searched wikipedia smile.gif. Im not very sure. Sorry.
User is offlineProfile CardPM
+Quote Post

realNoName
RE: Dictionary Search Algorithm
16 Oct, 2007 - 06:51 PM
Post #5

D.I.C Regular
***

Joined: 4 Dec, 2006
Posts: 299



Thanked: 5 times
My Contributions
Peter Norvig (Director of Research Google) has an essay on his site about how to write a spell check... might be a good read


http://www.norvig.com/spell-correct.html
User is offlineProfile CardPM
+Quote Post

PsychoCoder
RE: Dictionary Search Algorithm
16 Oct, 2007 - 07:10 PM
Post #6

using DIC.Core;
Group Icon

Joined: 26 Jul, 2007
Posts: 8,983



Thanked: 125 times
Dream Kudos: 8600
Expert In: VB, VB.Net, C#, SQL, ASP, ASP.Net, Web Development, HTML, CSS, Win32 API, Javascript, mySQL, J#, Boo.Net

My Contributions
Actually, what I would do is load the words into a serialized object and hold it in that state, then as I need it search the object and not the actual text file. When loading it I would load it via its own BackgroundWorker Thread so as to not interrupt the normal processing of the application.

While I'm doing threading anyways, I would have said background worker on its own timer to initialize the reloading of the words on a given increment, in the event a word was added since the last time the object was loaded, and since it is a background thread it wouldn't have any effect on the performance of the application, well not enough to notice really.

But thats just an idea that bounced around inside my head when I read this smile.gif
User is offlineProfile CardPM
+Quote Post

Programmist
RE: Dictionary Search Algorithm
19 Oct, 2007 - 07:22 AM
Post #7

Four-letter word
Group Icon

Joined: 2 Jan, 2006
Posts: 1,179



Thanked: 6 times
Dream Kudos: 100
Expert In: Java

My Contributions
that would work. If the dictionary got too big (millions), and you didn't want it to take up too much memory, then you might want to think about a local index/db and possibly build a local cache, based on recent activity. But 500,000 words (using unicode encoding at an average of 10 characters per word, a number which I'm pulling out of my a$$) would only set you back 10 MB, which is not much these days. Look at how much Firefox takes up. Right now mine is sitting at 128MB.
User is offlineProfile CardPM
+Quote Post

Kiriran
RE: Dictionary Search Algorithm
21 Oct, 2007 - 01:30 AM
Post #8

New D.I.C Head
*

Joined: 11 Apr, 2007
Posts: 41


My Contributions
My professor once told me that you can do such a dictionary search with a special type of tree: The BK-Tree. But don't ask me how it work tongue.gif There's a PS document on the wiki. Just reading it and it's pretty good
User is offlineProfile CardPM
+Quote Post

nullonehalf
RE: Dictionary Search Algorithm
22 Nov, 2007 - 08:02 AM
Post #9

New D.I.C Head
*

Joined: 7 Nov, 2007
Posts: 5


My Contributions
Besides implementing your dictionary as a tree where words are represented by paths from the root node to a leaf, you could also use a hash table. Depending upon the collision resolution scheme you used, you could expect to find a given word in fewer comparisons on average than would be required to traverse the tree.

One effective hash function would be H(L1, L2, WL) := (L1*26 + L2)*10 + min(WL - 2, 9), where L1 and L2 are the first and second letters of the word, respective, and WL is it length; this is a suggestion from a rather old academic paper, "Computer Programs for Detecting and Correcting Spelling Errors" by James L. Peterson of the University of Texas, and was used to implement some very early spell-checking programs.

The hash function could also be improved by attempting to exploit known patterns in the English language and assumptions about the expected use and behavior of your program. Is it looking up every word in the document to see if it matches some known word in the dictionary? If this is the case, the hash table could be subdivided into several smaller hash tables organized with very common words (a, the, and, etc.) in the first hash table, and less common words in the subsequent tables, with these tables indexed (probably) by word length.

The possible advantage of the hash table over the tree is that your search will generally be O(1 + m/n), where m/n is the ratio of table entries to actual words; since the dictionary will probably be checking whole documents at a time, the hash table can be expected to run faster since it can deal with the large volume of common words much more quickly than can the tree, which will have to blindly each word from scratch for each token. Moreover, the actual word strings will be immediately available for use in the program, whereas any string operations in the tree representation will either require the string to be constructed from the path, or will require special tree operations.

The trade-off is that the table will almost certainly take up more storage than the tree.

'\0' .5
User is offlineProfile CardPM
+Quote Post

Fast ReplyReply to this topicStart new topic
Time is now: 12/1/08 09:12PM

Live Help!

Tutorials

Programming

Web Development

Reference Sheets

Code Snippets

DIC Chatroom

Bye Bye Ads

Monthly Drawing

Thumb Drive

Top Contributors

Top 10 Kudos This Month