6 Replies - 1282 Views - Last Post: 15 April 2011 - 12:13 PM Rate Topic: -----

#1 saintmagician   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 14-April 11

Good data structure / library to use for analysing LOTS of data

Posted 14 April 2011 - 08:03 PM

Hey, I'm not sure if I'm posting on the right board. I need some generic pointers about how to tackle my problem. I picked the C# board because at the moment, I'm looking to code in C#. This doesn't have to be the case, so no need to restrict any suggestions to C# or OO languages.

I've got tuples of (game, card, location).
Game is a number between 1 and 100k
Card is a number between 1 and 144
Location is a number between 1 and 40
The values for 'game' and 'location' uniquely define the tuple (i.e. each tuple has a unique combination of game value and location value).

I have about 4 million of these tuples. I need to analyze this data, and do things such as:
- For a pair of cards 'a' and 'b', how often do they appear with the same 'game' value? e.g. The pair of tuples (100, a, -) and (100, b, -) would be one such occurrence. Ideally I'd like to do this for about 10k different possible pairs of cards.
- For a given card 'a', how often does it appear with a given location value 'b'.
- For a group of 6 cards, is it true that at least 2 of them always appear in the same game? (i.e. for all tuples with a particular 'game' value, does there exist at least two tuples whose card values belong to that group).

The above are just examples of the kinds of things I need to do with my information. I don't really need help implementing that.

What I need help with, is some suggestions on how to store the data (the 4 million tuples) and how to query the data e.g. count how many tuples have card=x, or how many have (game=x)&&(card=y)

I'm familiar with a range of programming paradigms and happy to do research into anything new. But with no particular experience handling lots of data, databases, etc. I'm just not too sure where to start. I'm hoping someone can provide some experienced pointers as to what kind of data structures I should be looking at, any existing libraries packages that could help me, etc.

Is This A Good Question/Topic? 0
  • +

Replies To: Good data structure / library to use for analysing LOTS of data

#2 Curtis Rutland   User is offline

  • (╯□)╯︵ (~ .o.)~
  • member icon


Reputation: 5106
  • View blog
  • Posts: 9,283
  • Joined: 08-June 10

Re: Good data structure / library to use for analysing LOTS of data

Posted 14 April 2011 - 08:06 PM

Quote

What I need help with, is some suggestions on how to store the data (the 4 million tuples) and how to query the data


How do you currently have it stored?

If I were to do this in C#, I'd probably use LINQ to Objects...though 4 million pieces of data might be a little much to hold in memory at once.
Was This Post Helpful? 0
  • +
  • -

#3 saintmagician   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 14-April 11

Re: Good data structure / library to use for analysing LOTS of data

Posted 14 April 2011 - 08:11 PM

Well currently I'm still collecting the data.... So it's stored in txt files generated by a script.

Just thinking ahead. Because once I started collecting the data, I realized how much of it I would have. Then realized I had no idea how I would be able to store the data in memory and how I'd go about querying it.

The raw data is a total of about 12 million int values.

This post has been edited by saintmagician: 14 April 2011 - 08:12 PM

Was This Post Helpful? 0
  • +
  • -

#4 Momerath   User is offline

  • D.I.C Lover
  • member icon

Reputation: 1021
  • View blog
  • Posts: 2,463
  • Joined: 04-October 09

Re: Good data structure / library to use for analysing LOTS of data

Posted 14 April 2011 - 08:46 PM

12 million int values is only 48MB, easy to hold in memory all at once. Just create a struct with the three fields.

As for the others, it really depends on what you need to do as for how to go about solving or what kind of data structure you might need.

I wrote some test code and for 4 million tuples, it took 50 milliseconds to solve this one: For a pair of cards 'a' and 'b', how often do they appear with the same 'game' value? Given 10,000 pairs. It will probably take you longer as I was just counting the pairs, I didn't do anything with the count after I computed it. This was done on a quad core AMD using TPL.

This post has been edited by Momerath: 14 April 2011 - 09:17 PM

Was This Post Helpful? 0
  • +
  • -

#5 Curtis Rutland   User is offline

  • (╯□)╯︵ (~ .o.)~
  • member icon


Reputation: 5106
  • View blog
  • Posts: 9,283
  • Joined: 08-June 10

Re: Good data structure / library to use for analysing LOTS of data

Posted 14 April 2011 - 09:10 PM

Well, if that's all it is, then yes, you can load it all up in memory. The IO might take a bit of time, but you can easily hold it.

Then LINQ will very much be your friend. It uses a pretty direct syntax. For example, lets say you have a data structure like this:

public class DataPoint{
  public int Game {get;set;}
  public int Card {get;set;}
  public int Location {get;set;}
}


And you had a list like this:

List<DataPoint> points = new List<DataPoint>();
//fill the list through some means


You'd be able to perform a query like so:

var query = from p in points
            group p by p.Game into g
            let cards = g.Select(x => x.Card)
            where cards.Contains(1) && cards.Contains(2)
            select g.Key;



Now query contains (or will once it's evaluated...linq uses lazy evaluation) a list of all games that contain cards 1 and 2.

You can perform many queries like this. And the lazy loading is great. It takes almost no runtime to construct the query. It's only enumerated as needed. The only time query would actually be evaluated is if you started looping through it (or performed some other enumerating function, like Count or All).
Was This Post Helpful? 1
  • +
  • -

#6 saintmagician   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 14-April 11

Re: Good data structure / library to use for analysing LOTS of data

Posted 14 April 2011 - 09:11 PM

I've never used LINQ before, but from what you sound it does seem like it will meet my requirements. I'll take a closer look at it. Thanks!
Was This Post Helpful? 0
  • +
  • -

#7 Curtis Rutland   User is offline

  • (╯□)╯︵ (~ .o.)~
  • member icon


Reputation: 5106
  • View blog
  • Posts: 9,283
  • Joined: 08-June 10

Re: Good data structure / library to use for analysing LOTS of data

Posted 15 April 2011 - 12:13 PM

For anyone interested:

This topic spawned a large discussion, which got a bit off topic. If you're interested in reading/taking part, that discussion has been moved here:

http://www.dreaminco...of-data%26quot/
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1