7 Replies - 6874 Views - Last Post: 25 January 2010 - 05:58 AM

#1 tip120   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 15
  • Joined: 11-November 09

Best way to parse massive amounts of data

Post icon  Posted 13 December 2009 - 09:09 AM

So, I have a massive text file, around 12 gigabytes with data in it, formatted like so:

data1 data2
data3 data4

I've tried several different ways to search this file, using grep, perl, and mysql.

I tried loading the data into a mysql database, with an index, and optimizing the database. Searching for a string took about 6~ minutes.
Using grep took about 3 minutes.
The perl script I posted below take about 2 minutes thirty seconds.

#!/usr/local/bin/perl
print "Input search string: ";
$input = <>;
chomp($input);
print "Searching for $input. Please wait.\n";
open FILE, "./file1" or die $!;
while (my $linelook = <FILE>) {
		if ($linelook =~m/^.*\b($input)\b.*$/i) {
				close FILE;
				die "String found: $linelook\n";
		}
}
print "String not found.\n";
close FILE;


 


This is the code I have been using. I would like to find a more efficient way to search this file.

The data in the file is largely file names and md5sums, and I want to be able to search for either file name or md5sum.

However, I would settle for just being able to search the md5sum efficiently.

My idea is to find a way to convert the md5sum from hex to decimal and store it in the database, since mysql is much faster searching int fields than char fields, however, I don't know how to store a 128 bit int in mysql, nor do I know how to convert a 32 bit hex string to a 128 dec int.

Any ideas?

Is This A Good Question/Topic? 0
  • +

Replies To: Best way to parse massive amounts of data

#2 dsherohman   User is offline

  • Perl Parson
  • member icon

Reputation: 227
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Best way to parse massive amounts of data

Posted 14 December 2009 - 05:50 AM

Although I can suggest some micro-optimizations on the Perl code you posted (mostly just tuning the regex), I'd be willing to bet that the vast majority of the run time is being spent on file I/O. From the numbers you've given, I would expect that just reading the file without doing any processing on it would take nearly three minutes.

The only way you're going to get a significant speedup, then, is to avoid having to read in the whole thing.

A linear scan (like what grep or your posted Perl code do) requires you to read in the whole file, with no possible optimization aside from bailing out once a match is found (which your Perl does; this is why it's faster than grep and able to find a match in less time than is required to read the entire file from disk).

To get away from having to do linear scans, you need to either sort or index your data. Since you want to be able to search on more than one thing, indexing (like, say, with a database, so you don't have to write all the indexing and search code yourself) is probably the way to go.

Which brings us to your 6-minute runtime with MySQL. This is pretty much as expected - if it takes almost three minutes to read the original file, then it will take at least another three minutes to write the same amount of data back out to the database file, plus a bit of overhead for processing and indexing the data. This should be a one-time cost, though, and (even with 12G of data, provided it's indexed on the field you're searching) subsequent search times should be in the sub-second range. If it's taking you six minutes for each and every search using MySQL, then you're doing it wrong. Post that code here and we can show you how to do it right.
Was This Post Helpful? 0
  • +
  • -

#3 tip120   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 15
  • Joined: 11-November 09

Re: Best way to parse massive amounts of data

Posted 14 December 2009 - 10:18 AM

View Postdsherohman, on 14 Dec, 2009 - 04:50 AM, said:

Although I can suggest some micro-optimizations on the Perl code you posted (mostly just tuning the regex), I'd be willing to bet that the vast majority of the run time is being spent on file I/O. From the numbers you've given, I would expect that just reading the file without doing any processing on it would take nearly three minutes.

The only way you're going to get a significant speedup, then, is to avoid having to read in the whole thing.

A linear scan (like what grep or your posted Perl code do) requires you to read in the whole file, with no possible optimization aside from bailing out once a match is found (which your Perl does; this is why it's faster than grep and able to find a match in less time than is required to read the entire file from disk).

To get away from having to do linear scans, you need to either sort or index your data. Since you want to be able to search on more than one thing, indexing (like, say, with a database, so you don't have to write all the indexing and search code yourself) is probably the way to go.

Which brings us to your 6-minute runtime with MySQL. This is pretty much as expected - if it takes almost three minutes to read the original file, then it will take at least another three minutes to write the same amount of data back out to the database file, plus a bit of overhead for processing and indexing the data. This should be a one-time cost, though, and (even with 12G of data, provided it's indexed on the field you're searching) subsequent search times should be in the sub-second range. If it's taking you six minutes for each and every search using MySQL, then you're doing it wrong. Post that code here and we can show you how to do it right.


I used load to load the code into the DB, then generated an index on the table I wanted to search, and did an optimize on the whole thing.

After that, I did a few select * from proglist where hash LIKE whatever;, and all of them took around six minutes.
Was This Post Helpful? 0
  • +
  • -

#4 dsherohman   User is offline

  • Perl Parson
  • member icon

Reputation: 227
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Best way to parse massive amounts of data

Posted 15 December 2009 - 05:09 AM

View Posttip120, on 14 Dec, 2009 - 05:18 PM, said:

After that, I did a few select * from proglist where hash LIKE whatever;, and all of them took around six minutes.

Perhaps I'm misunderstanding what you want to do, then. The code you initially posted was scanning through the file to locate a single item, but here it sounds like you're trying to retrieve all records, not just a single item. If that's the case, then you've got it backwards. :) Scanning through a plain text file will generally be the fastest way to get all records, while a database will usually be the fastest way to locate and retrieve just one of them.

Assuming you've already built the database,
#!/usr/local/bin/perl
use strict;
use warnings;

use DBI;

print "Input search string: ";
my $input = <>;
chomp($input);
print "Searching for $input. Please wait.\n";

my $dbh = DBI->connect([database details here]);
my $sth = $dbh->prepare('SELECT * FROM proglist WHERE hash = ?');
$sth->execute($input);
my @record = $sth->fetchrow_array;
$sth->finish;
if (@record) {
  print "String found: @record\n";
} else {
  print "String not found.\n";
}
should be equivalent to your posted code and run a good deal faster (assuming no typos on my part; I haven't actually run it). Even if you're running a query that can't use an index (which may arise if you're using LIKE), it should be reasonably fast - I just test-ran a query against an unindexed column of a 1.3 million record table and had results in about a second.

Going back to the I/O issue again, though, if your Perl/MySQL program is running that slow for you on single-record lookups, then first run the query manually from the command-line mysql interface to determine whether it's Perl or MySQL that's being slow, then target your optimization efforts accordingly.
Was This Post Helpful? 0
  • +
  • -

#5 tip120   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 15
  • Joined: 11-November 09

Re: Best way to parse massive amounts of data

Posted 18 December 2009 - 08:59 AM

Sorry if I'm being a little confusing. :(

Both my perl script and the select statement were to get one line or row respectively.

select * from proglist where hash LIKE "md5sum here";

Returned the filename, ID, and md5sum (all three columns) from the database, on the row where the md5sum entered is found. I got the last hash in the DB and ran that select statement a couple of times, and it averaged about 6 minutes each time searching a 211 some million rows. Searching the plaintext file with the perl script (with a tweaked regex, thanks for that, I dont know wtf I was thinking) takes around 3 minutes.

I ended up dropping the database though, because when I was trying to index it I ran out of HD space. :(

I think I've come up with a better solution than both the ones I had tried anyway, if it works, I'll post it. :)

This post has been edited by tip120: 18 December 2009 - 09:00 AM

Was This Post Helpful? 0
  • +
  • -

#6 aaronx   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 24-January 10

Re: Best way to parse massive amounts of data

Posted 24 January 2010 - 03:50 AM

Hi,

Sorry for the late reply, I have only just started looking around here and thought I would share my 2 cents.

My background is more from the database (Oracle mainly) world.

Around the comment about databases, a few things:
1. Creating an index on data at will return less than 3-5% of the total size of the entire data set is usually no worth it
2. Choosing a type of index will make a difference, I am guessing that might be why you ran out of HDD space
3. I don't think MySQL has the capability of creating function-based indexes, but if your database can, this will greatly speed up queries (on an Oracle database at work, we had a query always querying the first 6 chars of a string... data size around 1.5M rows, a function based index on substr(colname, 1, 6) dropped the cost by half!)
4. Like queries are not usually very efficient so avoid if possible
5. Are you calculating the "hash" column on the fly? You could always store a materialized view or store the actual hash on insertion if you are.


Hope you found a solution, would be keen on hearing what you found.

A.

View Posttip120, on 18 Dec, 2009 - 07:59 AM, said:

Sorry if I'm being a little confusing. :(

Both my perl script and the select statement were to get one line or row respectively.

select * from proglist where hash LIKE "md5sum here";

Returned the filename, ID, and md5sum (all three columns) from the database, on the row where the md5sum entered is found. I got the last hash in the DB and ran that select statement a couple of times, and it averaged about 6 minutes each time searching a 211 some million rows. Searching the plaintext file with the perl script (with a tweaked regex, thanks for that, I dont know wtf I was thinking) takes around 3 minutes.

I ended up dropping the database though, because when I was trying to index it I ran out of HD space. :(

I think I've come up with a better solution than both the ones I had tried anyway, if it works, I'll post it. :)

Was This Post Helpful? 0
  • +
  • -

#7 dsherohman   User is offline

  • Perl Parson
  • member icon

Reputation: 227
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Best way to parse massive amounts of data

Posted 25 January 2010 - 03:54 AM

View Postaaronx, on 24 Jan, 2010 - 10:50 AM, said:

1. Creating an index on data at will return less than 3-5% of the total size of the entire data set is usually no worth it

Really? Or did you mean "more" rather than "less"? I'm not a DBA, just a programmer who uses databases a lot, but, logically, it makes more sense to me that an index would be most useful when returning a small portion of the available data (letting you jump straight to the rows you're interested in) and fades into meaninglessness as you get closer to returning the entire available data set (you have to fetch them all anyhow, so it's just another layer of indirection creating more overhead).
Was This Post Helpful? 0
  • +
  • -

#8 baavgai   User is offline

  • Dreaming Coder
  • member icon


Reputation: 7506
  • View blog
  • Posts: 15,556
  • Joined: 16-October 07

Re: Best way to parse massive amounts of data

Posted 25 January 2010 - 05:58 AM

An index will make a query "where field='Xzy'" orders of magnitude faster. However, "where field like '%Xzy%'" will simply not use that index; it can't. At that point it's pure searching via brute force. If your database consists of something like a table with a single text field of line, you really aren't using it as a database.

Judging from the grep shown, it's separated by white space. So, if your structure was somewhat intelligent...

Something like:
Table: TextDoc
   TextDocId (PK)
   Description
   ...

Table: TextDocData
   TextDocId (PK, FK)
   Sequence (PK)
   Token



You could find all occurrences of $input significantly faster than a grep.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1