7 Replies - 12442 Views - Last Post: 05 December 2012 - 04:20 PM

#1 nick2price  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 562
  • View blog
  • Posts: 2,826
  • Joined: 23-November 07

multithreaded hashmap

Posted 30 March 2012 - 01:59 PM

I had an interview today, and an interesting topic came up. Basically this company deals with a lot of pdf files, which are basically scanned into a String. One operation this company has to do is store every word in this file in a hashmap, and if there are duplicate words, the value of the word (key) will increase. So say we had the words

then
did
wonder
did
how

it would be
then | 1
did | 2
wonder | 1
how | 1

Now if they have 10 pdf files, this operation would be slow n^10. So a way they improve effeciency is to use threading. The count of words in not for a single file, but for all files. So if there is a did in file 2, the single hashmap would increase to did | 3. This means that the threads cannot run at the exact same time, because if they both read in the word "did" in their individual files at the same time, and picked up the hash map at the same time, they both would be getting the hashmap at did | 3 and returning did | 4, when after both processes, the result should be did | 5. So there needs to be some sort of delay between processes, or one process cannot access the hashmap if another process already has it.

To me personally, this seems like a strange approach to take, although when a business needs to be efficient, alternatives need to be created. I was always under the impression that Hashmaps were not thread safe, so performing an operation which requires a lot of bounderies to be set to the thread must be risky.

Just wondering what people think of this approach, and what other alternatives they might use. Remember though, the key is effeciency, so it is no good processing one file at a time.

Is This A Good Question/Topic? 0
  • +

Replies To: multithreaded hashmap

#2 macosxnerd101  Icon User is online

  • Self-Trained Economist
  • member icon




Reputation: 10644
  • View blog
  • Posts: 39,514
  • Joined: 27-December 08

Re: multithreaded hashmap

Posted 30 March 2012 - 02:16 PM

The java.util.concurrent.ConcurrentHashmap class might be what you're looking for. There are a lot of Java Collections that are Thread-safe if you look in the concurrency packages.
Was This Post Helpful? 1
  • +
  • -

#3 nick2price  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 562
  • View blog
  • Posts: 2,826
  • Joined: 23-November 07

Re: multithreaded hashmap

Posted 30 March 2012 - 03:24 PM

I looked into that, and that would definately solve any issues with thread safety, so thats definately a viable option. I am going to sleep on it and see if I can come up with another approach. For some reason, I dont really like threads, so I am trying to think of a way where this could be avoided. But for now, this is the only thing I can think of which can perform concurrent operations, so its the only option. And I dont know the exact details about the efficiency of different Collections, so I also want to check if a hashmap would be the best option for this. I doubt something like an ArrayList will be any good, because when it comes to counting the words, it would require quite a few passes.
I will think about this, and see what I can come up with.

Cheers
Was This Post Helpful? 0
  • +
  • -

#4 nick2price  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 562
  • View blog
  • Posts: 2,826
  • Joined: 23-November 07

Re: multithreaded hashmap

Posted 31 March 2012 - 01:13 PM

At the stage of reading and sorting a single file. Just working on the best way to do things concurrently now using multiple files.

package javaapplication1;

import java.io.*;
import java.util.*;

public class WordCount {

   static TextReader in;   
   static PrintWriter out; 
   
   static class WordData { 
       
      String word;
      int count;
      WordData(String w) {
         word = w;
         count = 1;  
      }
   } // end class WordData
   
   static class CountCompare implements Comparator {
      public int compare(Object obj1, Object obj2) {
         WordData data1 = (WordData)obj1;
         WordData data2 = (WordData)obj2;
         return data2.count - data1.count;
      }
   } // end class CountCompare


   public static void main(String[] args){
      String[] args1 = {"Coursework2.txt", "out.txt"};
      openFiles(args1);  
      
      TreeMap words = new TreeMap(); 
      
      readWords(in,words);  
   
      List wordsByCount;  
      
      wordsByCount = new ArrayList(words.values());
      Collections.sort(wordsByCount, new CountCompare());

      out.println("Words found in the file named \"" + args1[0] + "\".\n");
      out.println("The number of times that the word occurred in the");
      out.println("file is given in parentheses after the word.\n\n");
      out.println("The words from the file in alphabetical order:\n");
      
      printWords(out, words.values());  // Print words alphabetically.

      out.println("\n\nThe words in order of frerquency:\n");
      
      printWords(out, wordsByCount);  // Prints words by frequency count.
      
      if (out.checkError()) {
         System.out.println("An error occurred while writing the data.");
         System.out.println("Output file might be missing or incomplete.");
         System.exit(1);
      }
      
      System.out.println(words.size() + " distinct words were found.");

   } // end main()
   
   
   static void printWords(PrintWriter outStream, Collection wordData) {
      Iterator iter = wordData.iterator();
      while (iter.hasNext()) {
         WordData data = (WordData)iter.next();
         outStream.println("   " + data.word + " (" + data.count + ")");
      }
   } 
   

   static void openFiles(String[] args1) {
     
      if (args1.length != 2) {
         System.out.println("Error: Please specify file names on command line.");
         System.exit(1);
      }
      try {
         in = new TextReader(new FileReader(args1[0]));
      }
      catch (IOException e) {
         System.out.println("Error: Can't open input file " + args1[0]);
         System.exit(1);
      }
      try {
         out = new PrintWriter(new FileWriter(args1[1]));
      }
      catch (IOException e) {
         System.out.println("Error: Can't open output file " + args1[1]);
         System.exit(1);
      }
   } // end openFiles()
   

   static void readWords(TextReader inStream, Map words) {
      try {
         while (true) { 
            while (! inStream.eof() && ! Character.isLetter(inStream.peek()))
               inStream.getAnyChar(); 
            if (inStream.eof())
               break;  // Exit because there is no more data to read.
            String word = inStream.getAlpha(); 
            word = word.toLowerCase();
            WordData data = (WordData)words.get(word);
            if (data == null) {
               words.put(word, new WordData(word) );
            }
            else {
               data.count = data.count + 1;
            }
         }
      }
      catch (TextReader.Error e) {
         System.out.println("An error occurred while reading the data.");
         System.out.println(e.toString());
         System.exit(1);
      }

   } // end readWords()
   
   
} // end class WordCount

   


Was This Post Helpful? 0
  • +
  • -

#5 Ghlavac  Icon User is offline

  • D.I.C Addict

Reputation: 84
  • View blog
  • Posts: 519
  • Joined: 14-January 09

Re: multithreaded hashmap

Posted 31 March 2012 - 03:42 PM

I would go with a divide-and-conquer approach in that, for each PDF i'd spawn a seperate hashmap, then merge them all when each has completed its task, so the file reads are all done in parallel and then merged together in series perhaps?
Was This Post Helpful? 0
  • +
  • -

#6 nick2price  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 562
  • View blog
  • Posts: 2,826
  • Joined: 23-November 07

Re: multithreaded hashmap

Posted 31 March 2012 - 03:48 PM

The only problem with that approach is that although you are reading in the files concurrently in order to produce a single file, the single file you are left with is still the contents of 10 files. So going through this single file will take a similar amount of time as going through them seperately. So in terms of proficiency, they would need to remain 10 individual files, and analysed concurrently.
Was This Post Helpful? 0
  • +
  • -

#7 TheGDeveloper  Icon User is offline

  • D.I.C Head

Reputation: 10
  • View blog
  • Posts: 93
  • Joined: 22-September 09

Re: multithreaded hashmap

Posted 09 November 2012 - 06:20 AM

No his solution is the more efficient one, i have run these type of code many times and this is the fastest solution. The time that you earn from the minimization of hashmap's locks during the process of the different documents it greater than the merging of the hashmaps in the end. And the final merging can be done in parrallel too
Was This Post Helpful? 0
  • +
  • -

#8 cfoley  Icon User is offline

  • Cabbage
  • member icon

Reputation: 2041
  • View blog
  • Posts: 4,223
  • Joined: 11-December 07

Re: multithreaded hashmap

Posted 05 December 2012 - 04:20 PM

9 months late but anyway...

Is counting the words really the slow part? I would have thought it would be reading the files, maybe parsing them but certainly not counting words. Did anyone profile the application before jumping on the threading bandwagon?

Threading doesn't change the complexity class of the algorithm, only the constant. If I can come up with a faster algorithm then that's at least as good as multithreading -- better because you might also be able to multithread that!

The hash map approach implies a three pass (maybe two pass) process. One pass to convert the PDF to a string, one to extract words from the String and another to calculate the hash function on each string. I guess the first two could be combined to make a two-pass function.

Instead of counting with hash maps, how about an alpha tree structure? One pass to get a string from the PDF, another pass to iterate over the string while traversing the tree. If you hit a space (or other non-word character) increment the node you are on and jump back to the root node. You can make this a one-pass algorithm if you can apply it to a stream of characters from the PDF file. Then you can multithread it too, if you have to.

Of course, if you haven't profiled your application then most of this will be a waste of time anyway!
Was This Post Helpful? 1
  • +
  • -

Page 1 of 1