13 Replies - 1274 Views - Last Post: 24 September 2018 - 06:43 PM Rate Topic: -----

#1 bobsmith76   User is offline

  • D.I.C Regular

Reputation: 11
  • View blog
  • Posts: 314
  • Joined: 14-February 17

threading not speeding things up

Posted 22 September 2018 - 03:05 AM

I've got a terrabyte of data so it's very important that I figure out how to loop through it quickly. Slatkin in Effective Python discusses how threading is sometimes just as fast as serial. He then shows how to speed things up. I can't figure out if 'select' is part of his program or not. Here's what he has:

# Example 6
import select, socket

# Creating the socket is specifically to support Windows. Windows can't do
# a select call with an empty list.
def slow_systemcall():
    select.select([socket.socket()], [], [], 0.1)


# Example 7
start = time()
for _ in range(5):
    slow_systemcall()
end = time()
print('Took %.3f seconds' % (end - start))


# Example 8
start = time()
threads = []
for _ in range(5):
    thread = Thread(target=slow_systemcall)
    thread.start()
    threads.append(thread)


# Example 9
def compute_helicopter_location(index):
    pass

for i in range(5):
    compute_helicopter_location(i)
for thread in threads:
    thread.join()
end = time()
print('Took %.3f seconds' % (end - start))




And that speeds things up by a factor of 5. I used his syntax for my needs. I eliminated the helicopter part and I also eliminated the 'select' part though I'm not sure what that does. In any case my serial code and threading code are equally as fast. Here's a simplified version of what I have:

str3 = "iweb_wlp_01_ote/"

threads = []
for i in range(2000):
    thread = Thread(target=build_dict_json, args = (str3, "01/", i))
    thread.start()
    threads.append(thread)


for thread in threads:
    thread.join()



And then the class, 'build_dict_json' just loops through a txt file and turns it into a dictionary then stores it as a json file, nothing fancy.

Is This A Good Question/Topic? 0
  • +

Replies To: threading not speeding things up

#2 andrewsw   User is offline

  • Entwickler
  • member icon

Reputation: 6604
  • View blog
  • Posts: 26,911
  • Joined: 12-December 12

Re: threading not speeding things up

Posted 23 September 2018 - 12:09 AM

Forums are for questions. You can create a blog for updates. We also have snippets and tutorials if you want to contribute.

Or, if I've misunderstood, then please provide your question.
Was This Post Helpful? 0
  • +
  • -

#3 bobsmith76   User is offline

  • D.I.C Regular

Reputation: 11
  • View blog
  • Posts: 314
  • Joined: 14-February 17

Re: threading not speeding things up

Posted 23 September 2018 - 12:22 AM

View Postandrewsw, on 23 September 2018 - 12:09 AM, said:

Forums are for questions. You can create a blog for updates. We also have snippets and tutorials if you want to contribute.

Or, if I've misunderstood, then please provide your question.


My question is why is it not speeding up.
Was This Post Helpful? 0
  • +
  • -

#4 Salem_c   User is offline

  • void main'ers are DOOMED
  • member icon

Reputation: 2219
  • View blog
  • Posts: 4,302
  • Joined: 30-May 10

Re: threading not speeding things up

Posted 23 September 2018 - 02:16 AM

*
POPULAR

First, you need to understand what CPU-Bound, Memory-Bound and IO-Bound means.
https://en.wikipedia.../wiki/CPU-bound
https://stackoverflo...-i-o-bound-mean

> I've got a terrabyte of data
So it's likely to be on spinning disks rather than an SSD.
https://en.wikipedia...characteristics
A 4-core 2.5Ghz machine is going to be executing one instruction every 10-10 seconds.
Combined hard disk seek and rotation latencies are 10-2 seconds.
That's 8 orders of magnitude slower - it's like 1 second compared to 3 YEARS.

> so it's very important that I figure out how to loop through it quickly.
How quickly?

This is about as fast as it's possible to read a file sequentially (just read and ignore).
$ dd if=lubuntu-18.04-desktop-i386.iso of=/dev/null bs=65536
16510+0 records in
16510+0 records out
1081999360 bytes (1.1 GB, 1.0 GiB) copied, 0.165472 s, 6.5 GB/s


If this takes your machine 1 hour to do this, then no amount of threading is going to magically transform the run time into 5 seconds. The task at hand is I/O bound in the worst possible way.

Unless there is some property of the data you can usefully exploit, like an internal record structure that allows you to skip over irrelevant records very quickly, the maxed-out transfer rate of your hard disk is just one elephant in the room.

What are you doing with the data once you've read it? If you store even just 1% of your 1TB of data, that's 10GB of RAM you're going to need (excluding any data structure overheads). If you've got that much RAM, fine. If you don't, you're getting burned twice as the data once read from your file now finds itself back on the hard disk in the swap file.

But if your testing is showing 1 hour just to read the file, but 10 hours to do all the processing, then you MAY have some opportunity to make use of threads.

> 04 for i in range(2000):
> 05 thread = Thread(target=build_dict_json, args = (str3, "01/", i))
A few points.
1. Just creating a thread has an overhead, so be sure you're doing a meaningful amount of work to offset that.
2. Be very wary of creating more threads than you have cores. Having lots of threads waiting for things to happen is fine. Having lots of threads waiting for CPU time is just going to burn time in context switches.
3. Unless you have a NUMA architecture, all your busy threads will still compete for the same memory. The various memory caches will help to a point.
Was This Post Helpful? 5
  • +
  • -

#5 bobsmith76   User is offline

  • D.I.C Regular

Reputation: 11
  • View blog
  • Posts: 314
  • Joined: 14-February 17

Re: threading not speeding things up

Posted 23 September 2018 - 02:37 AM

View PostSalem_c, on 23 September 2018 - 02:16 AM, said:

First, you need to understand what CPU-Bound, Memory-Bound and IO-Bound means.

I'm pretty sure it's CPU bound. It's just looping through a text file and transforming it into a dictionary.


Quote

> so it's very important that I figure out how to loop through it quickly.
How quickly?


It's a corpus of text. So I'd eventually like to index the top 5000 words. I'm hoping I can get that done in less than a week.



Quote

This is about as fast as it's possible to read a file sequentially (just read and ignore).
$ dd if=lubuntu-18.04-desktop-i386.iso of=/dev/null bs=65536
16510+0 records in
16510+0 records out
1081999360 bytes (1.1 GB, 1.0 GiB) copied, 0.165472 s, 6.5 GB/s


If this takes your machine 1 hour to do this, then no amount of threading is going to magically transform the run time into 5 seconds. The task at hand is I/O bound in the worst possible way.

Don't understand what you mean.


Quote

What are you doing with the data once you've read it?

I'm going to collect statistics on how words are used.


Quote

If you store even just 1% of your 1TB of data, that's 10GB of RAM you're going to need (excluding any data structure overheads). If you've got that much RAM, fine. If you don't, you're getting burned twice as the data once read from your file now finds itself back on the hard disk in the swap file.

Don't understand.

Here's a related question:

I read somewhere that only one python program can work its way through the virtual machine at a time. So I figured why not just use two virtual machines at once. I've succeeded in using two virtual machines at once, but the program is just as fast as putting two programs through the same machine. So when I run the program from start to finish it takes 40 seconds, but when I chop it in half and run each half on two separate terminals it only takes 25 seconds. But I'm assuming that both halves are using the same machine. I figure if I can get the halves through different machines then it would speed things up.

Maybe I'm really not putting two programs through one machine. After all, it seems to be too easy. Normally when I try something new with computers it takes 10 hours for it to work, whereas in this case I got it to work on the first try. In any case here are how my files are organized:


cool_folder > env > bin
my_folder > my_file.py
include
lib > python 3.6 > etc

cool_folder2 > venv > bin
include
lib > python 3.6 > etc
> my_folder2 > my_file2.py


So I run my_file.py and my_file2.py at the same time but they both finish at the same time as when I run my_file.py after waiting just a split second to start the second one.
Was This Post Helpful? 0
  • +
  • -

#6 baavgai   User is offline

  • Dreaming Coder
  • member icon


Reputation: 7257
  • View blog
  • Posts: 15,138
  • Joined: 16-October 07

Re: threading not speeding things up

Posted 23 September 2018 - 06:22 AM

Threading isn't magic. Indeed, it actually has overhead. Your problem must first lend itself to being broken down into units of work that can be done simultaneously. In your example, you've met that criteria with discreet calls: but they're socket calls!?!

More generally, processes can have bottlenecks. If the bottleneck is in your control, you can fix it. If not...

I can write a web site crawler and make it considerably faster with async calls. However, the fasted web crawler in the world won't convince the web site it's crawling to respond any faster. Indeed, if I spam a site it might even slow down, either burdened with so simultaneous requests or in automatic response to a potential DoS attack.

The bottleneck in a network request will generally be the latency of the network. Threads can allow you to make more calls at once, but how can they improve network latency, which is where the bulk of your time will usually be spent?
Was This Post Helpful? 0
  • +
  • -

#7 ndc85430   User is offline

  • I think you'll find it's "Dr"
  • member icon

Reputation: 890
  • View blog
  • Posts: 3,592
  • Joined: 13-June 14

Re: threading not speeding things up

Posted 23 September 2018 - 08:10 AM

View Postbobsmith76, on 23 September 2018 - 10:37 AM, said:

View PostSalem_c, on 23 September 2018 - 02:16 AM, said:

First, you need to understand what CPU-Bound, Memory-Bound and IO-Bound means.

I'm pretty sure it's CPU bound. It's just looping through a text file and transforming it into a dictionary.


This is precisely why you need to read the links that Salem_c posted. Your file resides on a disk, doesn't it? There's a latency associated with reading data from the disk into memory, so your process has to wait for those I/O operations to complete before it can do its work on the data.

Quote

Quote

If you store even just 1% of your 1TB of data, that's 10GB of RAM you're going to need (excluding any data structure overheads). If you've got that much RAM, fine. If you don't, you're getting burned twice as the data once read from your file now finds itself back on the hard disk in the swap file.

Don't understand.


Then you should read about paging.

This post has been edited by ndc85430: 23 September 2018 - 08:11 AM

Was This Post Helpful? 0
  • +
  • -

#8 Salem_c   User is offline

  • void main'ers are DOOMED
  • member icon

Reputation: 2219
  • View blog
  • Posts: 4,302
  • Joined: 30-May 10

Re: threading not speeding things up

Posted 23 September 2018 - 10:54 AM

I started with 100 copies of the works of Shakespeare, downloaded from Project Gutenberg. The result is a file about 0.5GB in size.

To give some idea of the relative time cost increments:

To just shovel the bytes from one place to another as quickly as possible.
$ time cat all-shakespeare100.txt > /dev/null
real	0m0.102s
user	0m0.004s
sys	0m0.096s



The simplest read a line at a time and split into words.
$ time wc all-shakespeare100.txt
 14968900  95989400 585240400 all-shakespeare100.txt
real	0m6.556s
user	0m6.472s
sys	0m0.080s



Now some simple python.
#!/usr/bin/python
import timeit
import sys

# Just split a line into words, roughly the same
# total number of words as in the file.
def test1():
    totalWords = 0
    for i in range(15998233):   # gives approx same total words in the text
        line = "The Complete Works of William Shakespeare"
        for word in line.split():
            totalWords += 1
    print("test1:Total words={}".format(totalWords))

# Just fetching from a file
def test2():
    totalLines = 0
    with open('all-shakespeare100.txt','r') as f:
        for line in f:
            totalLines += 1
    print("test2:Total lines={}".format(totalLines))

# Delta overhead of a word frequency histogram
histogram = {}
def test3():
    totalWords = 0
    with open('all-shakespeare100.txt','r') as f:
        for line in f:
            for word in line.split():
                try:
                    histogram[word] += 1
                except KeyError:
                    histogram[word] = 1
                totalWords += 1
    print("test3:Total words={}".format(totalWords))
    print("test3:Total unique words={}".format(len(histogram)))
    print("test3:MemUsed={}".format(sys.getsizeof(histogram)))


t1 = timeit.timeit('test1()',setup="from __main__ import test1",number=1)
t2 = timeit.timeit('test2()',setup="from __main__ import test2",number=1)
t3 = timeit.timeit('test3()',setup="from __main__ import test3",number=1)
print("Split only={0:.2f}".format(t1))
print("Read only={0:.2f}".format(t2))
print("Read, split and count={0:.2f}".format(t3))


Results
test1:Total words=95989398
test2:Total lines=14968900
test3:Total words=95989400
test3:Total unique words=75941
test3:MemUsed=3146008
Split only=8.20
Read only=0.88
Read, split and count=22.43


Reading the file (this is from an SSD) is relatively quick in Python, and not that much slower than 'cat'.
Just splitting into words is also comparable to the 'wc' utility.
The biggest part seems to be the histogram - but I'm no Python expert so my usage might be rubbish.

But on a ball-park, it seems it could process 1TB in 10 to 15 hours.

> It's a corpus of text. So I'd eventually like to index the top 5000 words.
> I'm hoping I can get that done in less than a week.
So do you have an existing program which according to your testing on smaller data sets will take a month?

Or is this an exercise in premature optimisation?
Programs which run once are seldom worth the effort to optimise. The days you spend trying to optimise it will never buy back the few hours of run-time you might save.

I mean, this thread is a day old, so if you'd just started the first simple code you had when you started the thread, it would likely be already finished.


On a different tack, is this corpus a single file, or lots of files?

Let's say you have split all the corpus files into 4 directories (because you have a 4-core machine). You could then very simply do
program.py set1/* > results1.txt &
program.py set2/* > results2.txt &
program.py set3/* > results3.txt &
program.py set4/* > results4.txt &


You get all the parallelism that's worth having without having to touch a line of code.
You go away and wait for these 4 background processes to complete.

The code to merge the 4 results*.txt files is then a simple step.
Was This Post Helpful? 1
  • +
  • -

#9 bobsmith76   User is offline

  • D.I.C Regular

Reputation: 11
  • View blog
  • Posts: 314
  • Joined: 14-February 17

Re: threading not speeding things up

Posted 23 September 2018 - 03:30 PM

Quote

Programs which run once are seldom worth the effort to optimise. The days you spend trying to optimise it will never buy back the few hours of run-time you might save.
I mean, this thread is a day old, so if you'd just started the first simple code you had when you started the thread, it would likely be already finished.

No, because I'm going to have to loop through it regularly. I'm just trying to find the best to loop through it.


Quote

On a different tack, is this corpus a single file, or lots of files?

Let's say you have split all the corpus files into 4 directories (because you have a 4-core machine). You could then very simply do
program.py set1/* > results1.txt &
program.py set2/* > results2.txt &
program.py set3/* > results3.txt &
program.py set4/* > results4.txt &


You get all the parallelism that's worth having without having to touch a line of code.
You go away and wait for these 4 background processes to complete.

It's 100 folders each with about 1000 files. I don't see how this is parallelism, but I don't understand the code you're writing. Are these commands lines since command line does use &? If the * is some kind of operator that enables parallelism then I would seriously like to know.


Quote

So do you have an existing program which according to your testing on smaller data sets will take a month?
Or is this an exercise in premature optimisation?


Actually I found this code that to my astonishment speeds up the code by a factor of 3000. I've checked the output several times to make sure it's right since it just seems too good to be true.


    def child(i):
        put_function_here(str3, "01/", i)
        os._exit(0)  # else goes back to parent loop

    def parent():
        for i in range(0, 10):
            p (i)
            newpid = os.fork()
            if newpid == 0:
                child(i)



Maybe it's due to this sentence written in Lutz's Programming Python

Quote

Because forking is ingrained in the Unix programming model, this script works well on Unix, Linux, and modern Macs.


And I am using a Mac. In any case, I would seriously like to know why this code is so fast.

Quote

This is precisely why you need to read the links that Salem_c posted. Your file resides on a disk, doesn't it? There's a latency associated with reading data from the disk into memory, so your process has to wait for those I/O operations to complete before it can do its work on the data.

It said that CPU bound refers to the operation which are intense on crunching numbers. Since I'm just converting lines of text into dictionary, I thought that that would fall under CPU. In any case, as far as I/O bound, I don't see how any computer program is not I/O bound since all computer programs take an input and return an output.

View Postbaavgai, on 23 September 2018 - 06:22 AM, said:

Threading isn't magic. Indeed, it actually has overhead. Your problem must first lend itself to being broken down into units of work that can be done simultaneously. In your example, you've met that criteria with discreet calls: but they're socket calls!?!

I don't know what a socket call is.


Quote

Threads can allow you to make more calls at once, but how can they improve network latency, which is where the bulk of your time will usually be spent?

I'm just opening text files and converting them into a dictionary so I don't see how this relates to network latency.

This post has been edited by bobsmith76: 23 September 2018 - 03:24 PM

Was This Post Helpful? 0
  • +
  • -

#10 Salem_c   User is offline

  • void main'ers are DOOMED
  • member icon

Reputation: 2219
  • View blog
  • Posts: 4,302
  • Joined: 30-May 10

Re: threading not speeding things up

Posted 23 September 2018 - 10:58 PM

> Are these commands lines since command line does use &?
Yes, these are command lines.
And yes, the & means that the command is running in the background. The only effect of that is to immediately return a command prompt so you can type in another command.

You could just open multiple console windows and type individual commands into each one, and forego the use of the & at the end of each line.

> If the * is some kind of operator that enables parallelism then I would seriously like to know.
No, it's just filename globbing.
It's a way of specifying a lot of files without having to write out the full filename of every single file.

> It's 100 folders each with about 1000 files.
Now you know what glob patterns are, can you write say 10 different patterns which encompass all files?

If you can, you can construct a series of command lines to launch many script instances (all running in parallel) on some subset of your entire corpus.
Was This Post Helpful? 0
  • +
  • -

#11 modi123_1   User is offline

  • Suitor #2
  • member icon



Reputation: 14423
  • View blog
  • Posts: 57,818
  • Joined: 12-June 08

Re: threading not speeding things up

Posted 23 September 2018 - 11:21 PM

Crazy idea, store things in a database and not in 100 folders with 1000 files.
Was This Post Helpful? 0
  • +
  • -

#12 bobsmith76   User is offline

  • D.I.C Regular

Reputation: 11
  • View blog
  • Posts: 314
  • Joined: 14-February 17

Re: threading not speeding things up

Posted 24 September 2018 - 12:49 AM

View PostSalem_c, on 23 September 2018 - 10:58 PM, said:

> Are these commands lines since command line does use &?
Yes, these are command lines.
And yes, the & means that the command is running in the background. The only effect of that is to immediately return a command prompt so you can type in another command.

You could just open multiple console windows and type individual commands into each one, and forego the use of the & at the end of each line.


Actually, I tried that method and it only sped things up by 33%. Also when I used 5 windows instead of 2 I got the same performance. But that's ok, the forking method seems to be the answer.
Was This Post Helpful? 0
  • +
  • -

#13 astonecipher   User is offline

  • Senior Systems Engineer
  • member icon

Reputation: 2669
  • View blog
  • Posts: 10,657
  • Joined: 03-December 12

Re: threading not speeding things up

Posted 24 September 2018 - 08:41 AM

There is not magic bullet. Whatever you do, you are still bound by what the machine can actually do. So, figure out a different way to organize the data, or get a MUCH better machine to run it against. It doesn't matter if you have 300 programs, the processor, memory, and harddrive have physical limitations on what they can do.
Was This Post Helpful? 0
  • +
  • -

#14 bobsmith76   User is offline

  • D.I.C Regular

Reputation: 11
  • View blog
  • Posts: 314
  • Joined: 14-February 17

Re: threading not speeding things up

Posted 24 September 2018 - 06:43 PM

View PostSalem_c, on 23 September 2018 - 10:58 PM, said:

And yes, the & means that the command is running in the background. The only effect of that is to immediately return a command prompt so you can type in another command.


I just want to double check that the & operator allows two programs to run in parallel. Because in my early days of programming before I understood that I could run a program by specifying the full path I would change the director to the folder that the program was in, then run the program. And I would use the & to separate the two programs. I suppose putting the change directory command first will still succeed in changing the program even if the next program runs in parallel since changing directories happens so quickly.

View Postmodi123_1, on 23 September 2018 - 11:21 PM, said:

Crazy idea, store things in a database and not in 100 folders with 1000 files.


I bought the data and that is how the guy organized it. I guess he wanted the data to be as universal as possible, text files being the files that every language can read.

Also I haven't yet got a comment on this observation:

Quote

Here's a related question:

I read somewhere that only one python program can work its way through the virtual machine at a time. So I figured why not just use two virtual machines at once. I've succeeded in using two virtual machines at once, but the program is just as fast as putting two programs through the same machine. So when I run the program from start to finish it takes 40 seconds, but when I chop it in half and run each half on two separate terminals it only takes 25 seconds. But I'm assuming that both halves are using the same machine. I figure if I can get the halves through different machines then it would speed things up.

Maybe I'm really not putting two programs through one machine. After all, it seems to be too easy. Normally when I try something new with computers it takes 10 hours for it to work, whereas in this case I got it to work on the first try. In any case here are how my files are organized:


cool_folder > env > bin
my_folder > my_file.py
include
lib > python 3.6 > etc

cool_folder2 > venv > bin
include
lib > python 3.6 > etc
> my_folder2 > my_file2.py


So I run my_file.py and my_file2.py at the same time but they both finish at the same time as when I run my_file.py after waiting just a split second to start the second one.


It turns out that I deleted env folder in the cool_folder and put the my_file.py in the cool_folder and ran the program my_file.py and it ran anyway. So I guess I had not really set up two virtual machines.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1