Ready for another high octane, explosion-filled and death defying episode of the Programming Underground?!?! Sure you are! Today I give my take on a blog article I stumbled across today called "Computer Algorithms: Linear Search in Sorted Lists" which gives a look at applying the linear search model to something sorted. I will offer a suggestion to perhaps make this a bit better for implementation and perhaps performance reasons. So buckle up and we will take another spin around the Programming Underground!
<Massive explosions and killer stunts to rock music theme music>
Ok, so if you are a reasonably skilled programmer you are probably saying to yourself "Linear search for sorted lists? Are you on crack? Use an algorithm like binary search!". And no, I am not on crack and neither is Stoimen Popov. What he is attempting to say is that for sorted lists which contain often searched for values, why not implement a linear search over these common values and go to the binary search method if the value has not already been found?
What he is describing is a cache. Starting with a sorted list we search for a value and once found, we move it to the top of the list. Next time we search for the value, we do a quick linear search at the top... which would be much quicker than a binary search every time. If it is not found in the first few entries at the top of the list, then you would go to a binary search to find the value in the remaining sorted list.
This approach is sound and definitely on the right track. The problem with it, besides having to go back to the binary search if the items being searched for was not at the top, is that if a person tries to do this on a list that is searched often, eventually you will throw your sorted list into disorganization. Perhaps you search for the value "Tom" a lot and someone else search "John" a lot and someone else "Jack" or "George" or "Ken" etc. Next thing you know you have hundreds of items out of order at the top of your list in a list that was suppose to be sorted. In other words, it is going to degrade over time. Especially if this list is being hit through something that shares the list.
My take on this is that we should look at breaking out this list of commonly searched for values into its own list, leaving the original list intact. Now perhaps this is what the original poster meant but failed to mention.
The thought here is that we will not destroy the order of the original sorted list by moving items we find to the top. Instead we will find the value and put it into a second list of X items. Then when a new search is performed, we linear search this subset list first and if not found then we go to a binary search on the original list. This allows the original list to remain "pure" and also gives us a second list to manipulate.
If we search for a value that was found in our smaller "cached" list, we move it to the top of the cached list and continue on. If it was found in the larger list, we add it to the top of the cached list. Once the cached list reaches X items (we can base the size of this list depending on the size of the original list), we start popping off items from the bottom. Thus the cached list is really a FIFO structure.
My initial attempt would be to make this cache list anywhere from 10 - 20% the size of the original list. So if we had 100 items, the cached item would be 10 to 20 items. However, this will always depend on the type of data you are storing.

The only real two setbacks I could see with this is how to implement a generic enough version of it and that it can take up additional memory (the classic space for performance trade off). Do we make the cache something static between calls? Your language and design will surely dictate how you do this. And do we have the space for this cached list?
I think putting this cache in its own list, broken out from the original list will prevent the original list from degrading, will allow us to manipulate the size of the cached terms, afford us the ability to even share the cache between lists if we wanted and also take advantage of the small linear search before using the binary search theory.
I am glad I stumbled across this article and will be giving it more thought during my implementations of it. It really kicks us out of the rut of seeing a sorted list and immediately skipping linear search as a method. Why not use it for a small subset and then use the binary as a backup? No harm in that I think!
Thanks for reading! :)
If you want more blog entries like this, check out the official blog over on The Coders Lexicon. There you will find more code, more guides and more resources for programmers of all skill levels!
<Massive explosions and killer stunts to rock music theme music>
Ok, so if you are a reasonably skilled programmer you are probably saying to yourself "Linear search for sorted lists? Are you on crack? Use an algorithm like binary search!". And no, I am not on crack and neither is Stoimen Popov. What he is attempting to say is that for sorted lists which contain often searched for values, why not implement a linear search over these common values and go to the binary search method if the value has not already been found?
What he is describing is a cache. Starting with a sorted list we search for a value and once found, we move it to the top of the list. Next time we search for the value, we do a quick linear search at the top... which would be much quicker than a binary search every time. If it is not found in the first few entries at the top of the list, then you would go to a binary search to find the value in the remaining sorted list.
This approach is sound and definitely on the right track. The problem with it, besides having to go back to the binary search if the items being searched for was not at the top, is that if a person tries to do this on a list that is searched often, eventually you will throw your sorted list into disorganization. Perhaps you search for the value "Tom" a lot and someone else search "John" a lot and someone else "Jack" or "George" or "Ken" etc. Next thing you know you have hundreds of items out of order at the top of your list in a list that was suppose to be sorted. In other words, it is going to degrade over time. Especially if this list is being hit through something that shares the list.
My take on this is that we should look at breaking out this list of commonly searched for values into its own list, leaving the original list intact. Now perhaps this is what the original poster meant but failed to mention.
The thought here is that we will not destroy the order of the original sorted list by moving items we find to the top. Instead we will find the value and put it into a second list of X items. Then when a new search is performed, we linear search this subset list first and if not found then we go to a binary search on the original list. This allows the original list to remain "pure" and also gives us a second list to manipulate.
If we search for a value that was found in our smaller "cached" list, we move it to the top of the cached list and continue on. If it was found in the larger list, we add it to the top of the cached list. Once the cached list reaches X items (we can base the size of this list depending on the size of the original list), we start popping off items from the bottom. Thus the cached list is really a FIFO structure.
My initial attempt would be to make this cache list anywhere from 10 - 20% the size of the original list. So if we had 100 items, the cached item would be 10 to 20 items. However, this will always depend on the type of data you are storing.

The only real two setbacks I could see with this is how to implement a generic enough version of it and that it can take up additional memory (the classic space for performance trade off). Do we make the cache something static between calls? Your language and design will surely dictate how you do this. And do we have the space for this cached list?
I think putting this cache in its own list, broken out from the original list will prevent the original list from degrading, will allow us to manipulate the size of the cached terms, afford us the ability to even share the cache between lists if we wanted and also take advantage of the small linear search before using the binary search theory.
I am glad I stumbled across this article and will be giving it more thought during my implementations of it. It really kicks us out of the rut of seeing a sorted list and immediately skipping linear search as a method. Why not use it for a small subset and then use the binary as a backup? No harm in that I think!
Thanks for reading! :)
If you want more blog entries like this, check out the official blog over on The Coders Lexicon. There you will find more code, more guides and more resources for programmers of all skill levels!
8 Comments On This Entry
Page 1 of 1
mojo666
07 December 2011 - 05:03 PM
If you want to maintain performance, the cache will have to be proportional to log(x) where x is the size of the list. If you make it 10 to 20 percent, you will be slowing the search as it takes more time to search such a cache than it would to just do a binary search on the origional list.
AdamSpeight2008
08 December 2011 - 02:30 AM
Make the cache a HashDictionary. O(1) checking, then O(log n) for non-cache.
What and how is "common value" determined? Do they mention that?
What and how is "common value" determined? Do they mention that?
CasiOo
08 December 2011 - 10:26 AM
Wouldn't you be visiting about/more elements with doing a linear search on 10-20% than just doing a binary search?
RetardedGenius
08 December 2011 - 11:46 AM
I think that there are very few situations where this approach would be beneficial. I'd like to a see an example of where this data structure would be more appropriate than a self-balancing binary tree or hash table...
The Architect 2.0
08 December 2011 - 07:50 PM
i think he meant log2(x). which would be 6.6 for a master list of 100.
Ricky65
09 December 2011 - 09:47 AM
Linear search has O(n) complexity in the worst case and O(n/2) complexity in the average case whereas binary search is O(log n) in the worst case and O(1) in the best case. It is therefore obvious why binary search is preferred over linear search for sorted lists.
elgose
09 December 2011 - 11:43 PM
The Architect 2.0 is right, since we're dealing with computer science base-2 should be implied.
While I see a slight bit of benefit in this design under specific conditions, the fact that your data is sorted really kills any need for this. A dataset of 10,000,000 items would only require at MOST 24 comparisons to find the item or realize it doesn't exist using your typical binary search. This is the same reason your cache would need to be limited to log(x) [base-2] at largest... if it's any larger, say a cache of 25 for 10million items, then your cache-only linear search is actually less efficient than a no cache at all implementation if that item is in the 25th cache slot (and especially if it's in the non-cache list). Even if your cache is log(x), this means if the item isn't in the cache (24 comparisons), then you have to check the non-cache list of 9,999,976 [10mill-24].
If the item isn't in the list, then you need log(9999976) more comparisons - which is 24. So instead of 24, you needed 24 for the cache and another 24 for the non-cache.
If you're concerned about most recently found items being closer to the "top", you might actually be more interested in a splay tree. On average, performance is that of a normal binary tree, O(log n), and items go to the root of the list when they are found.
While I see a slight bit of benefit in this design under specific conditions, the fact that your data is sorted really kills any need for this. A dataset of 10,000,000 items would only require at MOST 24 comparisons to find the item or realize it doesn't exist using your typical binary search. This is the same reason your cache would need to be limited to log(x) [base-2] at largest... if it's any larger, say a cache of 25 for 10million items, then your cache-only linear search is actually less efficient than a no cache at all implementation if that item is in the 25th cache slot (and especially if it's in the non-cache list). Even if your cache is log(x), this means if the item isn't in the cache (24 comparisons), then you have to check the non-cache list of 9,999,976 [10mill-24].
If the item isn't in the list, then you need log(9999976) more comparisons - which is 24. So instead of 24, you needed 24 for the cache and another 24 for the non-cache.
If you're concerned about most recently found items being closer to the "top", you might actually be more interested in a splay tree. On average, performance is that of a normal binary tree, O(log n), and items go to the root of the list when they are found.
Page 1 of 1
Tags
My Blog Links
Recent Entries
Recent Comments
-
-
Narek Babadjanyan
on May 06 2013 09:40 AM
Demo of Twitter Application-Only OAuth Authentication Using Java
-
-
-
Curtis Rutland
on Jan 29 2013 01:54 PM
Stash Data Away In Controls With The Tag Property Using VB.NET
Search My Blog
0 user(s) viewing
0 Guests
0 member(s)
0 anonymous member(s)
0 member(s)
0 anonymous member(s)
|
|



8 Comments








|