Subscribe to Martyr2's Programming Underground        RSS Feed
***** 2 Votes

Introduction to a SkipList Data Structure in Java

Icon 12 Comments
Hello everyone! The other day I was approached by macosxnerd101, a newly minted moderator here at DIC, and asked to contribute to a new thread coming up on Java data structures. Now most of the new programmers here know of the basic data structures like binary trees, arrays, heaps or stacks. Many of these topics I knew would be covered by other contributors to the thread. So I decided to introduce you to one that is based on our old friend the linked list, but with a new twist. It is called the SkipList (aka JumpList). We will talk a bit about what it is, how to build one in Java and what the heck they are good for. So sit right back and lets throw around some data structures on this episode of the Programming Underground!

<Nightwish's song Eva but sang by a man with a really high voice theme music>

To understand what a Skiplist is, we have to understand what a linked list is. For those of you who don't know, a linked list is a collection of objects which are linked together by one "node" pointing to the next one in the chain. Think of it as a single file congo line. You put your hands on the hips of the person in front of you and they do likewise to the one in front of them. Starting at one end of the line, we can reach each node by following who they are holding onto.

Now as you can imagine a double linked list points in both directions. It points to the one in front of it and following it as well. This allows us to start at the beginning of the line or at the end of the line and still reach a node going either direction. To form a double linked list we simple make sure each node is pointing to the node we are adding to the list and that the one we are adding is pointing back to the one before it.

A SkipList (also known as a JumpList when the links are geometrically spaced) starts with the classic linked list. However, instead of just pointing to the node that it follows, or is in front of it, it can also point to nodes much further down the line. By having these extra pointers we can literally "skip" nodes in the line if we know they are not what we are after. If we know we need the 50th person in the congo line, why start at the beginning and move through the first 49 people? What if we asked the person at the start of the line "Can you point out who the 50th person is?" and then go straight there.

By having these extra pointers spaced out at our discretion, we can speed up lookups within a linked list. Depending on how many pointers and how we organize them, we can drastically increase operations in a linked list. Just keep in mind however that we have to take care of the pointers as we modify the list and this means writes/deletes may need to make additional operations from time to time.

In our example below we create a list of 100 nodes in our linked list. Each one of them is simply carrying an integer representation of where in the line they are. So we have nodes marked from 1 to 100. We are going to setup our nodes to include four extra pointers. Not all nodes in the chain will need to have these point to an actual object. Instead we will set them only on certain nodes. Below is a picture of how our setup works...

Posted Image

As you can see from the image above, our first node points to node number 2, but also will be set to point to node 25 (quarter of the way through the list) and also point to node number 50 (half way through the list). Node 25 will point to the next quarter which is also node number 50. Number 50 points to nodes 100 (next half) and also 75 (next quarter).

These new pointers are then used as "express lanes" when it comes to finding a value. If we are looking for node number 77, instead of looping through 76 nodes to get to it, we can ask "Is the first half way node still less than 77?" Yes, it is 50. So we jump all the way to node 50. Once there we ask that node "Is the next half way node still less than 77?" In this case it would be no because the next node is 100. We then ask "Ok, is the next quarter node less than 77?" Yes, it is 75. So we jump to 75. "Is the next quarter less than 77?" No, the next quarter is 100.

Now that we have no more express lanes to jump through, all we have left is our single linked list. We then go to 76 and finally reach 77. The overall process has lead to 4 iterations rather than 76 iterations if we had gone through one by one. This process can also work in reverse starting from the tail since we are also double linked through our previous pointers.

How might we accomplish this task using Java? Well... here is an example...


public class SkipListExample {
    private static Node Head = null;
    private static Node Tail = null;
    private static Node Current = null;
	
    // Records previous quarter or half mark
    private static Node quarter = null;
    private static Node half = null;
	
    public static void main(String[] args) {
        int ListCount = 100;
		
        // Setup a standard double linked list manually
        for (int i = 1; i <= ListCount; i++) {
            Node newNode = new Node();
            newNode.data = i;
			
		
            // Assign to head node if this is the first node
            if (Head == null) { 
                Head = newNode; 
                Tail = newNode; 
                Current = newNode;
				
                quarter = newNode;
                half = newNode;
            }
            else {
                Current.next = newNode;
                newNode.prev = Current;
				
                // Add a quarter pointer if this node is a multiple of 25
                if ((i % 25) == 0) { 
                    newNode.prevQuarter = quarter; 
                    quarter.nextQuarter = newNode;
					
                    quarter = newNode;
                }

                // Add a half pointer if this node is a multiple of 50
                if ((i % 50) == 0) { 
                    newNode.prevHalf = half; 
                    half.nextHalf = newNode;
					
                    half = newNode;
                }
				
                // Set current node to be the next in line, set the tail
                // and then loop back around for next node.
                Current = newNode;
                Tail = newNode;
            }	
        }
		

        // Run some tests
        System.out.println("Find number 7... ");
        FindNumber(7);
		
        System.out.println("Find Number 33... ");
        FindNumber(33);
		
        System.out.println("Find number 67... ");
        FindNumber(67);
		
        System.out.println("Find number 101... ");
        FindNumber(101);	
    }
	
	
    // Searches our skip lists to locate our number
    public static void FindNumber(int num) {
        Node currentNode = Head;
        boolean Found = false;
		
        while (currentNode != null) {
            // Simply show what nodes we have checked in our search
            System.out.println("Checked node with data: " + currentNode.data);

            // End search if value is greater than the value we are looking for.
            if (currentNode.data > num) { break; }
			
			
            // Check our various pointers to see if a jump would get us closer.
            if (currentNode.data != num) {
                if ((currentNode.nextHalf != null) && (currentNode.nextHalf.data <= num)) { currentNode = currentNode.nextHalf; }
                else if ((currentNode.nextQuarter != null) && (currentNode.nextQuarter.data <= num)) { currentNode = currentNode.nextQuarter; }
                else { currentNode = currentNode.next; }
            }
            else {
                Found = true;
                break;
            }
        }
		
        // Report our findings
        if (Found) { System.out.println("Number Found!"); }
        else { System.out.println("Number wasn't found!"); }
    }
}


// Our Node object with prev/next pointers and jump pointers
class Node {
    public int data;
    public Node next = null;
    public Node prev = null;
	
    // Specialized skip list pointers
    public Node nextHalf = null;
    public Node prevHalf = null;
	
    public Node nextQuarter = null;
    public Node prevQuarter = null;
}



The code above starts off by first creating a 100 node double linked list using our class "Node" (as seen towards the end). In addition to each next/prev pointer we have also added nextHalf/prevHalf and nextQuarter/prevQuarter. As we are building the list we are testing the value given to each node to see if it is a multiple of 25 (presenting a quarter of our 100 node list) and also if it is a multiple of 50 (start, half way or end of our list). At these special nodes we also set their corresponding skip list pointers. At each one of these special nodes we also save a reference it for when we reach the next "milestone". We can use this reference to then get at it and set its next pointer while at the same time being able to set the previous pointer for the node we are currently adding. Think of that part like a bookmark of sorts.

Once we are done building the list, we have all our pointers in place and now we are ready to test them out with a test "FindNumber" function. Each time we visit a node we print out a small message so we can see how many nodes are visited during our search. The function goes through each node and takes the quickest path to its destination. It first tests the half way node and uses it if it can, otherwise it will look to the next quarter pointer and then finally the individual node pointers.

How can this sort of system break down? Well, if you misplace your pointers, perhaps don't space them out properly or decide to add/remove nodes from the list without updating pointers then you can quickly find yourself cutting down the efficiency. For instance if you were to take our list, just the way it is, and made it 1 million nodes you would be able to look up values under 100 quite fast, but finding node 402,199 would still require a sequential lookup from node 100 onwards.

Other skipping schemes you could implement would be even or odd, marking prime numbers, square numbers, showing multiples or perhaps built in decision making. This type of data structure allows a SORTED linked list to lookup and locate nodes more quickly than a sequential process. Skip list structures can also be thought of as tree data structures but with multiple branches. Either way, if you have a lot of nodes in a sorted order and are going to be using it to find values multiple times, consider a skiplist. Depending on the data and how often it is accessed, this could even be faster than a binary search algorithm!

Hope you enjoyed this introduction to skip lists! Thanks again for reading! :)

If you want more blog entries like this, check out the official blog over on The Coders Lexicon. There you will find more code, more guides and more resources for programmers of all skill levels!

12 Comments On This Entry

Page 1 of 1

Dogstopper Icon

13 July 2010 - 09:40 PM
That's really cool Martyr2! Nice topic and good turorial! I look forward to the next one!
0

nano-gilmour Icon

14 July 2010 - 12:53 AM
When searching for a node that doesn't exist in the list your code will traverse the whole list, to fix that you have to stop whenever you encounter a node with data greater than the data your looking for.
Other than that it's a nice tutorial :D
1

macosxnerd101 Icon

14 July 2010 - 04:24 AM
Martyr2, this was an awesome read, especially first thing after waking up. I had played around conceptually with this idea when I was starting out with Linked Lists to optimize for binary search, and I came to the conclusion that it was too much work and too array-like. So this tutorial was definitely very enlightening and very interesting, as it showed otherwise. Keep up the good work! :^:
0

xclite Icon

14 July 2010 - 05:58 AM
Skiplists are definitely an interesting data structure (probably one of my favorites) and I was excited to see this featured - well done.
1

Martyr2 Icon

14 July 2010 - 10:16 AM

nano-gilmour, on 13 July 2010 - 10:53 PM, said:

When searching for a node that doesn't exist in the list your code will traverse the whole list, to fix that you have to stop whenever you encounter a node with data greater than the data your looking for.
Other than that it's a nice tutorial :D



You are right... it is one of the down sides of this method and again, the skipping worst case scenario here is that all nodes will have to be checked for sure. In other words, worse case scenario this data structure is linear. One optimization would certainly be what you described of performing a check and stopping once a value is greater than I am searching for. It is a great point.

Do keep in mind however that if the value is not in the list, but it is greater than the lowest value it will utilize the skips to get close to where the value "should have been". So in many cases the structure will still cut down the search. For instance if I am looking for value 78 and it wasn't in my list of 100. I would still be utilizing the skip to 50, the skip to 75 and then it would go sequentially one by one until the end of the list. Cutting my iterations down significantly still from having to search all 100.

But point taken. I have added the check. Thanks for pointing it out. :)
0

athlon32 Icon

14 July 2010 - 01:31 PM
This is great! Very easy to read, and no details left out. The code was also very clean, and totally understandable (unlike other data structure tutorials I've seen). This deserves five stars! Keep up the good work. :)
0

jignesh272 Icon

08 August 2010 - 04:19 PM
Awesome tutorial.. looking forward for next one. :phone:
0

Jayman Icon

17 August 2010 - 08:22 PM
That was a fantastic read. Very well thought out and written.

I really enjoyed it. Keep up the excellent work.
1

Munawwar Icon

23 September 2010 - 01:58 PM
Actually, skip lists aren't extremely structured like the way you have shown it. The level heights are decided based on a probability factor.

William Pugh (the one who invented skip lists) calls it a "A Probabilistic Alternative to Balanced Trees" (You can find the paper here).

The problem with balanced trees is that it spends a lot of time restructuring itself (on node insertions and deletions). Balanced trees aims at minimizing search time by making a 'perfect' structure - I mean highly optimized.

Skip lists aims at using probability to make a structure that is reasonably good (means faster insertions and deletions)....so as to save time by NOT restructuring the whole data structure. But this obviously makes a search operation on a skip list slightly slower as compared to balanced trees.

So...whats the moral of the story? We saved some time on insertions and deletions.Does it matter?
Well, it depends upon the task at hand. If you are trying to optimize search, then use trees.If you want faster insertions and deletions then use skip lists.

There is one problem with skip lists though. Remember that, a insertion/deletion requires a search operation. Since skip lists don't maintain a perfect structure, there is a probability that over time, the structure deteriorates and makes the search operation considerably slow (and in turn making insertions/deletions slow).In such a case, a restructuring will be needed...But heck, it only happens once in a blue moon!

To understand how skip lists work read the paper here or watch this MIT video lecture.

A side topic:
I was thinking, how about deciding level heights based on search frequency...somewhat like a cache....hmmm
0

Martyr2 Icon

03 October 2010 - 10:33 PM
Actually I do mention that the structure of a typical list is determined by spacing the pointers at the builders discretion. This discretion I refer to is the probability factor. You would fit your pointers and "express lanes" to more closely match the probability of certain items being searched for. Even the video you refer to mentions that if I want to give all items an equal weight I would look to uniformly distribute the pointers. So the example is certainly valid given I want the "ideal" structure that gives equal weight.

But thanks for the additional info. Will make great review material. :)
0

Munawwar Icon

04 October 2010 - 06:48 AM

Quote

By having these extra pointers spaced out at our discretion, we can speed up lookups within a linked list.
...
...
...
Other skipping schemes you could implement would be even or odd, marking prime numbers, square numbers, showing multiples or perhaps built in decision making. This type of data structure allows a SORTED linked list to lookup and locate nodes more quickly than a sequential process. Skip list structures can also be thought of as tree data structures but with multiple branches. Either way, if you have a lot of nodes in a sorted order and are going to be using it to find values multiple times, consider a skiplist. Depending on the data and how often it is accessed, this could even be faster than a binary search algorithm!

Oops, Sorry, I didn't see that. I word searched 'probability' :P, and didn't find you mentioning it :D.
Anyway, Martyr2, this is a great introduction to skip lists.

I read my last post where I said:

Quote

Since skip lists don't maintain a perfect structure, there is a probability that over time, the structure deteriorates and makes the search operation considerably slow (and in turn making insertions/deletions slow).In such a case, a restructuring will be needed...But heck, it only happens once in a blue moon!

Nah, I am wrong. In a probabilistic approach, the structure does deteriorates over time - and it is common thing. I think, the skip list should be restructured at regular intervals - say after every 1000 insertion operations or so - hmm.
0

charpede Icon

24 October 2011 - 07:41 PM
After learning earlier today what a Linked List is from macoxnerd it was very refreshing to see a tutorial that was well written, and about a not so commonly thought of topic. This was a great read and a fantastic thing that should help me in the future. Thanks for posting!
0
Page 1 of 1