Subscribe to Martyr2's Programming Underground        RSS Feed
-----

Linear Search for Sorted Lists: My Take

Icon 8 Comments
Ready for another high octane, explosion-filled and death defying episode of the Programming Underground?!?! Sure you are! Today I give my take on a blog article I stumbled across today called "Computer Algorithms: Linear Search in Sorted Lists" which gives a look at applying the linear search model to something sorted. I will offer a suggestion to perhaps make this a bit better for implementation and perhaps performance reasons. So buckle up and we will take another spin around the Programming Underground!

<Massive explosions and killer stunts to rock music theme music>

Ok, so if you are a reasonably skilled programmer you are probably saying to yourself "Linear search for sorted lists? Are you on crack? Use an algorithm like binary search!". And no, I am not on crack and neither is Stoimen Popov. What he is attempting to say is that for sorted lists which contain often searched for values, why not implement a linear search over these common values and go to the binary search method if the value has not already been found?

What he is describing is a cache. Starting with a sorted list we search for a value and once found, we move it to the top of the list. Next time we search for the value, we do a quick linear search at the top... which would be much quicker than a binary search every time. If it is not found in the first few entries at the top of the list, then you would go to a binary search to find the value in the remaining sorted list.

This approach is sound and definitely on the right track. The problem with it, besides having to go back to the binary search if the items being searched for was not at the top, is that if a person tries to do this on a list that is searched often, eventually you will throw your sorted list into disorganization. Perhaps you search for the value "Tom" a lot and someone else search "John" a lot and someone else "Jack" or "George" or "Ken" etc. Next thing you know you have hundreds of items out of order at the top of your list in a list that was suppose to be sorted. In other words, it is going to degrade over time. Especially if this list is being hit through something that shares the list.

My take on this is that we should look at breaking out this list of commonly searched for values into its own list, leaving the original list intact. Now perhaps this is what the original poster meant but failed to mention.

The thought here is that we will not destroy the order of the original sorted list by moving items we find to the top. Instead we will find the value and put it into a second list of X items. Then when a new search is performed, we linear search this subset list first and if not found then we go to a binary search on the original list. This allows the original list to remain "pure" and also gives us a second list to manipulate.

If we search for a value that was found in our smaller "cached" list, we move it to the top of the cached list and continue on. If it was found in the larger list, we add it to the top of the cached list. Once the cached list reaches X items (we can base the size of this list depending on the size of the original list), we start popping off items from the bottom. Thus the cached list is really a FIFO structure.

My initial attempt would be to make this cache list anywhere from 10 - 20% the size of the original list. So if we had 100 items, the cached item would be 10 to 20 items. However, this will always depend on the type of data you are storing.

Attached Image

The only real two setbacks I could see with this is how to implement a generic enough version of it and that it can take up additional memory (the classic space for performance trade off). Do we make the cache something static between calls? Your language and design will surely dictate how you do this. And do we have the space for this cached list?

I think putting this cache in its own list, broken out from the original list will prevent the original list from degrading, will allow us to manipulate the size of the cached terms, afford us the ability to even share the cache between lists if we wanted and also take advantage of the small linear search before using the binary search theory.

I am glad I stumbled across this article and will be giving it more thought during my implementations of it. It really kicks us out of the rut of seeing a sorted list and immediately skipping linear search as a method. Why not use it for a small subset and then use the binary as a backup? No harm in that I think!

Thanks for reading! :)

If you want more blog entries like this, check out the official blog over on The Coders Lexicon. There you will find more code, more guides and more resources for programmers of all skill levels!

8 Comments On This Entry

Page 1 of 1

mojo666 Icon

07 December 2011 - 05:03 PM
If you want to maintain performance, the cache will have to be proportional to log(x) where x is the size of the list. If you make it 10 to 20 percent, you will be slowing the search as it takes more time to search such a cache than it would to just do a binary search on the origional list.
4

AdamSpeight2008 Icon

08 December 2011 - 02:30 AM
Make the cache a HashDictionary. O(1) checking, then O(log n) for non-cache.

What and how is "common value" determined? Do they mention that?
0

CasiOo Icon

08 December 2011 - 10:26 AM
Wouldn't you be visiting about/more elements with doing a linear search on 10-20% than just doing a binary search?
0

RetardedGenius Icon

08 December 2011 - 11:46 AM
I think that there are very few situations where this approach would be beneficial. I'd like to a see an example of where this data structure would be more appropriate than a self-balancing binary tree or hash table...
0

Martyr2 Icon

08 December 2011 - 03:07 PM
Well common value is determined by the person designing the system. It could be integers where we know certain numbers may come up more than not. This could be certain company names for a shipping business (maybe "IBM" comes up more often than "mom and pa shop") because they do more shipping via IBM. There may be times where you don't know what that common value is and only will they show themselves during production use.

Now I see people slightly confusing this with "search smaller list vs search larger list" when that is not what he is saying at all. Obviously if the cache is big enough you don't gain over doing something like a binary search on the original list.

The trick here is that the cache is common data values. Can I find a previously found value faster linearly through the cache than making enough (chop in half) operations on the larger list. The more recently search for the faster this approach will be. The last value searched for will be found on the FIRST iteration linearly through the cache than trying to find it through binary search on the original list.

Now I can't say for sure on the size of the cache being a log of the size of the list. For this implementation to really run at its worst it would have to either 1) not find it in the cache and then have to pretty much start a new with a binary on the original list or 2) Found the item at the bottom of the cache.

Honestly I don't see log(x) being at all right because if I made the size of the list 100 elements, this would mean log(100) which is 2. I could still find a value faster in a linear search of a 4 list item than it would take to find the value in a binary search (less operations). Maybe 10% is too big still but it would have to be bigger than log(x). Even with a 1000 list item log(1000) would only be 3.

But again it would really depend on what a common value is and how common it is given the domain of the list.

I guess what they were trying to say and what I am iterating in this blog post is that best case scenario I can find the value once at the top of the cache and short-circuit the entire need of even a fast binary search. Worst case scenario I don't find it and have to result to doing the original binary search. You will certainly have to test this structure against other methods. What we don't want to do is degrade an existing sorted list for this type of implementation.

Thanks for the great input guys! :)
2

The Architect 2.0 Icon

08 December 2011 - 07:50 PM
i think he meant log2(x). which would be 6.6 for a master list of 100.
0

Ricky65 Icon

09 December 2011 - 09:47 AM
Linear search has O(n) complexity in the worst case and O(n/2) complexity in the average case whereas binary search is O(log n) in the worst case and O(1) in the best case. It is therefore obvious why binary search is preferred over linear search for sorted lists.
0

elgose Icon

09 December 2011 - 11:43 PM
The Architect 2.0 is right, since we're dealing with computer science base-2 should be implied.

While I see a slight bit of benefit in this design under specific conditions, the fact that your data is sorted really kills any need for this. A dataset of 10,000,000 items would only require at MOST 24 comparisons to find the item or realize it doesn't exist using your typical binary search. This is the same reason your cache would need to be limited to log(x) [base-2] at largest... if it's any larger, say a cache of 25 for 10million items, then your cache-only linear search is actually less efficient than a no cache at all implementation if that item is in the 25th cache slot (and especially if it's in the non-cache list). Even if your cache is log(x), this means if the item isn't in the cache (24 comparisons), then you have to check the non-cache list of 9,999,976 [10mill-24].

If the item isn't in the list, then you need log(9999976) more comparisons - which is 24. So instead of 24, you needed 24 for the cache and another 24 for the non-cache.

If you're concerned about most recently found items being closer to the "top", you might actually be more interested in a splay tree. On average, performance is that of a normal binary tree, O(log n), and items go to the root of the list when they are found.
1
Page 1 of 1

September 2014

S M T W T F S
 123456
78910111213
1415 16 17181920
21222324252627
282930