Subscribe to Stuck in an Infiniteloop        RSS Feed
***** 1 Votes

An Application of Suffix Arrays

Icon 4 Comments
I was on CodeAnthem the other day and the following query was presented:

Quote

Given a string, return the longest repeating substring, i.e. for the string "When life hands you lemonade, make lemons", the longest repeating substring is "lemon".

(More accurately it is " lemon", but for the purposes of this post we'll ignore surrounding white space, more on this later).

So how does one approach this problem? My research led me to conclude that the best avenue of attack was a suffix array.

Algorithm:
1. Create an array of every suffix of the given string
2. Sort
3. Iterate through the array comparing the current index to the previous index
4. Assemble a temporary string, continue until the characters do not match, keep a track of the longest substring thus far
5. If the length of the assembled string exceeds the counter add it to a collection of "possibles"
6. Return the longest string out of the possibles.



For a visualization, let's start with the word "banana".

Quick sidebar: The problem might specify that the repeating substring must not overlap, but that's not a specification here, so the longest repeating substring is "ana" rather then "an" or "na".


Create a suffix array:

The easiest way to do this is to simply strip off the leading character and store that substring:

public static String findLongestSubstring(String value) {
		String[] strings = new String[value.length()];
		String longestSub = "";
		
		//strip off a character, add new string to array
		for(int i = 0; i < value.length(); i++){
			strings[i] = new String(value.substring(i));
		}
		//debug/visualization
		//before sort
		for(int i = 0; i < strings.length; i++){
			System.out.println(strings[i]);
		}
//...continued later



Banana's suffix array:

Quote

banana
anana
nana
ana
na
a


This in and of itself isn't really helpful, but once we sort it...

Sort:

//let's not reinvent the wheel
        Arrays.sort(strings);
		System.out.println();
		
		//debug/visualization
		//after sort
		for(int i = 0; i < strings.length; i++){
			System.out.println(strings[i]);
		}



Banana's suffix array becomes:

Quote

a
ana
anana
banana
na
nana


Now that it's sorted, repeating sections of the string become more apparent.

Iteration and comparison:

Since we don't need to verify or check overlap all we need to check is the current index and the one that precedes it.

Vector<String> possibles = new Vector<String>();
		String temp = "";
		int curLength = 0, longestSoFar = 0;
		
		/*
		 * now that the array is sorted compare the letters
		 * of the current index to those above, continue until 
		 * you no longer have a match, check length and add
		 * it to the vector of possibilities
		 */	
		for(int i = 1; i < strings.length; i++){
			for(int j = 0; j < strings[i-1].length(); j++){
				if (strings[i-1].charAt(j) != strings[i].charAt(j)){
					break;
				}
				else{
					temp += strings[i-1].charAt(j);
					curLength++;
				}
			}
			//this could alleviate the need for a vector
			//since only the first and subsequent longest 
			//would be added; vector kept for simplicity
			if (curLength >= longestSoFar){
				longestSoFar = curLength;
				possibles.add(temp);
			}
			temp = "";
			curLength = 0;
		}



Search possibles:

We now have some possibilities for longest substring. Iterate through the vector and grab the longest one. We are concerned with whitespace up until the end, so trim() the result and return:

//iterate through the vector to find the longest one
		int max = 0;
		for(int i = 0; i < possibles.size();i++){
			//debug/visualization
			System.out.println(possibles.elementAt(i));
			if (possibles.elementAt(i).length() > max){ 
				max = possibles.elementAt(i).length();
				longestSub = possibles.elementAt(i);
			}
		}
		System.out.println();
		//concerned with whitespace up until this point
		// "lemon" not " lemon" for example
		return longestSub.trim(); 
	}



Output:

Quote

//vector contents
a
ana

//end result
ana



Full code with several examples:
(added more print statements for visualization)

public class Test{
	public static void main(String[] args){
		System.out.println(findLongestSubstring("i like ike"));
		System.out.println();
		System.out.println(findLongestSubstring("madam i'm adam"));
		System.out.println();
		System.out.println(findLongestSubstring("When life hands you lemonade, make lemons"));
		System.out.println();
		System.out.println(findLongestSubstring("abracadabra"));
		System.out.println();
		System.out.println(findLongestSubstring("banana"));
	}
	public static String findLongestSubstring(String value) {
		String[] strings = new String[value.length()];
		String longestSub = "";
		
		//strip off a character, add new string to array
		for(int i = 0; i < value.length(); i++){
			strings[i] = new String(value.substring(i));
		}
		
		//debug/visualization
		//before sort
		for(int i = 0; i < strings.length; i++){
			System.out.println(strings[i]);
		}
		
		Arrays.sort(strings);
		System.out.println();
		
		//debug/visualization
		//after sort
		for(int i = 0; i < strings.length; i++){
			System.out.println(strings[i]);
		}
		
		Vector<String> possibles = new Vector<String>();
		String temp = "";
		int curLength = 0, longestSoFar = 0;
		
		/*
		 * now that the array is sorted compare the letters
		 * of the current index to those above, continue until 
		 * you no longer have a match, check length and add
		 * it to the vector of possibilities
		 */	
		for(int i = 1; i < strings.length; i++){
			for(int j = 0; j < strings[i-1].length(); j++){
				if (strings[i-1].charAt(j) != strings[i].charAt(j)){
					break;
				}
				else{
					temp += strings[i-1].charAt(j);
					curLength++;
				}
			}
			//this could alleviate the need for a vector
			//since only the first and subsequent longest 
			//would be added; vector kept for simplicity
			if (curLength >= longestSoFar){
				longestSoFar = curLength;
				possibles.add(temp);
			}
			temp = "";
			curLength = 0;
		}
		
		System.out.println();
		//debug/visualization
		System.out.println("Longest string length from possibles: " + longestSoFar);
		
		System.out.println();
		//iterate through the vector to find the longest one
		int max = 0;
		for(int i = 0; i < possibles.size();i++){
			//debug/visualization
			System.out.println(possibles.elementAt(i));
			if (possibles.elementAt(i).length() > max){ 
				max = possibles.elementAt(i).length();
				longestSub = possibles.elementAt(i);
			}
		}
		System.out.println();;
		//concerned with whitespace up until this point
		// "lemon" not " lemon" for example
		return longestSub.trim(); 
	}
}



Output sans debug printing:

Quote

ike
adam
lemon
abra
ana



--
Happy coding!

4 Comments On This Entry

Page 1 of 1

Martyr2 Icon

07 September 2010 - 09:50 AM
Very very nice. I must say that it is a beautiful and elegant implementation. Looks like it could be easily ported over to another language too. All around A+.

:)
0

Mercurial Icon

09 September 2010 - 12:50 AM
Nice solution, I like it.
0

Tuishimi Icon

11 September 2010 - 08:58 AM
"Return the longest, repeating substring". It is nice, but aren't you reusing letters in "banana"? Shouldn't the longest be "an" or "na"?
0

KYA Icon

11 September 2010 - 01:06 PM
@Tuishimi "ana" repeats if you don't count the overlapping 'a'. One could avoid overlapping with a quick conditional while checking for letter patterns

@Martyr2 and Mercurial Thanks! :)
0
Page 1 of 1

November 2014

S M T W T F S
      1
2345678
9101112131415
16171819202122
2324 25 26272829
30      

Tags

    Recent Entries

    Recent Comments

    Search My Blog

    2 user(s) viewing

    2 Guests
    0 member(s)
    0 anonymous member(s)