6 Replies - 386 Views - Last Post: 26 April 2019 - 03:50 AM

#1 bobsmith76   User is offline

  • D.I.C Regular

Reputation: 11
  • View blog
  • Posts: 378
  • Joined: 14-February 17

general speed problem with big data

Posted 24 April 2019 - 12:26 AM

I'm new to working with big data and consequently don't understand a whole lot what I'm doing. I have a data set of 14 billion words of text and it takes me roughly 7 hours to loop through it all. My main problem is that almost all of the time is sucked up in opening and saving files. It took me roughly the same amount of time to just move the 800 gigs of data from one disk to another as it did to loop through all the words and collect stats on them. I'm going to acquire a new language anyway. Right now I'm using Python which is among the slowest of languages and I plan on learning C and Java which are among the fastest of languages but I have a feeling that C and Java is not going to speed things since it seems that 90% of my time is consumed by saving files to the disk. I've done experiments in upgrading from one computer to another. So in one experiment the exact same program sped up twice as fast when I upgraded from a 2011 intel processor to a 2016 intel processor. (I forget the exact specs of the processor). So I'm a little skeptical that upgrading hardware will solve things. I'm also almost certainly going to get a cheap laptop for $200 and put Linux on it, so we'll see if that helps. But in any case, I need to find some way to loop through 800 gigs of data in about 1 hour.

Is This A Good Question/Topic? 0
  • +

Replies To: general speed problem with big data

#2 baavgai   User is offline

  • Dreaming Coder
  • member icon


Reputation: 7449
  • View blog
  • Posts: 15,442
  • Joined: 16-October 07

Re: general speed problem with big data

Posted 24 April 2019 - 04:50 AM

This might be a job for a database. An RDBMS is optimized for a few things: read speed, write speed, indexing.

If saving files is your bottleneck, then you might consider reading from one physical drive and writing to another physical drive. If you are reading and writing on the same drive robustly, your processes are waiting for each other. Note, parallelism won't solve the problem if your issue is drive thrashing: it might even exacerbate it.

Also, your bottleneck could actually be memory. If you're loading a ton of data into memory, then that will not only slow down your entire machine, it will also thrash the drive that has paging files one it.
Was This Post Helpful? 0
  • +
  • -

#3 bobsmith76   User is offline

  • D.I.C Regular

Reputation: 11
  • View blog
  • Posts: 378
  • Joined: 14-February 17

Re: general speed problem with big data

Posted 25 April 2019 - 02:06 AM

View Postbaavgai, on 24 April 2019 - 11:50 AM, said:

This might be a job for a database. An RDBMS is optimized for a few things: read speed, write speed, indexing.

If saving files is your bottleneck, then you might consider reading from one physical drive and writing to another physical drive. If you are reading and writing on the same drive robustly, your processes are waiting for each other. Note, parallelism won't solve the problem if your issue is drive thrashing: it might even exacerbate it.

Also, your bottleneck could actually be memory. If you're loading a ton of data into memory, then that will not only slow down your entire machine, it will also thrash the drive that has paging files one it.


Thanks a lot for your help. I think my problem is just a general lack of knowledge. I might be ok when it comes to coding but I really don't understand issues related to handling big data properly. I don't even know what drive thrashing is. Where do I read up on this issue?

Also I'm not sure that I am loading a ton of data into memory. Each file is between 10 megs and 100 megs, I open one, process it, then close it. So at any given time I'm pretty sure that I do not have all that much memory being used.

What about upgrading hardware. I figure that since this project is essentially my life right now, I might as well go big and maybe invest in some technology in the price range of $2000. However, I've upgraded hardware before and it only sped things up by a factor of 2.
Was This Post Helpful? 0
  • +
  • -

#4 modi123_1   User is online

  • Suitor #2
  • member icon



Reputation: 15113
  • View blog
  • Posts: 60,480
  • Joined: 12-June 08

Re: general speed problem with big data

Posted 25 April 2019 - 07:01 AM

Quote

I don't even know what drive thrashing is. Where do I read up on this issue?

The Googles! http://lmgtfy.com/?q=drive+thrashing :D

Quote

I open one, process it, then close it. So at any given time I'm pretty sure that I do not have all that much memory being used.

What does "process it" mean? Is something being stored in memory afterwards?
Was This Post Helpful? 0
  • +
  • -

#5 bobsmith76   User is offline

  • D.I.C Regular

Reputation: 11
  • View blog
  • Posts: 378
  • Joined: 14-February 17

Re: general speed problem with big data

Posted 26 April 2019 - 12:11 AM

Process it means just perform some simple operation on the info, such as change a row in a text file to a python list. This is followed by saving it to a pickle.

This post has been edited by andrewsw: 26 April 2019 - 12:52 AM
Reason for edit:: removed previous quote

Was This Post Helpful? 0
  • +
  • -

#6 andrewsw   User is offline

  • never lube your breaks
  • member icon

Reputation: 6798
  • View blog
  • Posts: 28,102
  • Joined: 12-December 12

Re: general speed problem with big data

Posted 26 April 2019 - 12:53 AM

Note that there is a Reply button further down the page, it is not necessary to quote the previous post in full.
Was This Post Helpful? 0
  • +
  • -

#7 baavgai   User is offline

  • Dreaming Coder
  • member icon


Reputation: 7449
  • View blog
  • Posts: 15,442
  • Joined: 16-October 07

Re: general speed problem with big data

Posted 26 April 2019 - 03:50 AM

View Postbobsmith76, on 26 April 2019 - 02:11 AM, said:

change a row in a text file to a python list.

Potentially expensive.

View Postbobsmith76, on 26 April 2019 - 02:11 AM, said:

This is followed by saving it to a pickle.

Almost certainly more expensive that it needs to be.

Also, related to memory issues, google "python pickle memory leak"

This also has interesting implications:

Quote

The pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again.
-- https://docs.python....ary/pickle.html

Was This Post Helpful? 0
  • +
  • -

Page 1 of 1