Most efficient way to go through a file
Page 1 of 114 Replies - 1595 Views - Last Post: 07 January 2013 - 06:17 PM
#1
Most efficient way to go through a file
Posted 06 January 2013 - 03:30 PM
I am thinking loading the whole thing each time would be a bit bunch but if I want to move between quotes I need to read it all out into a list. Would that be best? Read the entire file to a list then randomly choose from the list or query the file each time?
Replies To: Most efficient way to go through a file
#2
Re: Most efficient way to go through a file
Posted 06 January 2013 - 03:37 PM
from random import shuffle
with open(filename) as f:
lines = f.readlines()
shuffle(lines)
print(lines.pop()) #prints a random line and removes it from the list of lines
EDIT:
From the requirements you listed, this should work. However, give us more detail on the application and I can probably give you a better solution :-P
This post has been edited by atraub: 06 January 2013 - 03:43 PM
#3
Re: Most efficient way to go through a file
Posted 06 January 2013 - 04:13 PM
f.readlines() may cause problems if the file is large. Or maybe i am talking rubbish.
Remember that you can always test the performance of a method using timeit module, i.e.
import timeit timerObject = timeit.Timer(stmt = "", setup = "from __main__ import method") totalTime = timerObject.timeit(number = 1000)
This post has been edited by darek9576: 06 January 2013 - 04:17 PM
#4
Re: Most efficient way to go through a file
Posted 06 January 2013 - 07:31 PM
atraub, on 06 January 2013 - 05:37 PM, said:
from random import shuffle
with open(filename) as f:
lines = f.readlines()
shuffle(lines)
print(lines.pop()) #prints a random line and removes it from the list of lines
EDIT:
From the requirements you listed, this should work. However, give us more detail on the application and I can probably give you a better solution :-P
I just want a small windows application to pull random quotes from a file and display them. I think your solution will work to prevent the same from being displayed mroe than once per time the application is started. However I will need to test the length of the list and if zero then reread the file. Or have two lists.
#5
Re: Most efficient way to go through a file
Posted 06 January 2013 - 11:24 PM
find a random number between 0 and the length of the file in bytes
have the file object set it's pointer to that spot
find the end of the current line
read the next line
example:
>>> import os >>> os.path.getsize(filename) 117635 >>> import random >>> spot = random.randint(0,os.path.getsize(filename)) >>> spot 116192 >>> f.seek(spot) >>> f.readline() /\/\/\/\/\/\/\/\/ >>> rand_line = f.readline()
1 possible gotcha is if "spot" is in the last line of your file, in which case you'll need a new spot.
#6
Re: Most efficient way to go through a file
Posted 07 January 2013 - 04:24 AM
If you want to pick a random line with equal probability and you don't want to/can't read the file into memory, you'll probably have to do something like this:
- Iterate over the lines to count them
- Pick a random number between 0 and the number of lines (exclusive)
- Iterate over the lines again with an index and return the current line when the index equals the random number
To avoid picking duplicate lines, you'd need to keep a list of lines that you have already picked. So every time you choose a line number, you check whether its in that list. If it is, pick another one.
Alternatively to iterating over the file twice, you could store the starting indices of each line in a list when you iterate the first time. Then after picking a random number, you can directly seek to the starting index of that line. This will consume more memory than the first approach though (one integer per line in the file - so still significantly less memory than reading the whole file assuming the lines aren't ultra short), so you'd be trading runtime for memory consumption here.
Using this approach you can avoid picking duplicates by removing a line's entry from the index list after picking it.
PS: The most efficient way as far as runtime goes is to read the whole file into memory using readlines as atraub initially suggested. Only if the file is too large to fit into memory, do the other alternatives make sense.
This post has been edited by sepp2k: 07 January 2013 - 04:30 AM
#7
Re: Most efficient way to go through a file
Posted 07 January 2013 - 06:16 AM
If you're going to read all the lines to count them, then read again for the seek, you're in the worse case scenario for speed, but memory won't be clobbered by super large files.
If you read all the lines into memory, then pick from that list, memory is hit hardest, but you'll win on speed as long as you have it.
I like to split the difference between the two: Read all the lines, keep that running count. However, for each line read, you do a random call based on the total read thus far. If you roll 0, keep the line. This should give you an random distribution across the file.
Best case for a flat file? Make an index file. The index file simply consists of the starting positions of each line in the data file. Read that into memory and randomly pick. Use the position to seek the line in the file. You can read the index once and remove values that you've used, for that effect you want.
Hope this helps.
#8
Re: Most efficient way to go through a file
Posted 07 January 2013 - 06:52 AM
baavgai, on 07 January 2013 - 02:16 PM, said:
When you say "keep the line" do you mean "select that line as your randomly chosen line"? If so, won't that always select the first line? Or at least select early lines with a much higher probability than later ones? Because the total read so far after the first line will be 1 and randrange(1) will always give you 0, while randint(0,1) will still give you a 50% chance of 0 - and if the first line has a 50% chance of being picked that's clearly not an equal distribution if there are more than 2 lines.
#9
Re: Most efficient way to go through a file
Posted 07 January 2013 - 07:31 AM
sepp2k, on 07 January 2013 - 08:52 AM, said:
Granted, it's initially counter intuitive. Sort of. Rather than thinking of it as a connected set, think of each random choosing as unconnected. Yes, the first item get's picked first. Then the next item has a 50% chance of knocking it off. The next, 1/3 chance and so on. Each item should have the same chance of being picked.
That is, I think. I don't have a mathematical proof handy, but I do have a computer.
Test program:
import random
def randomPick1(items):
index = random.randint(0,len(items)-1)
return (index, items[index])
def randomPick2(items):
count = 0
(pickIndex, pickItem) = (0, None)
for item in items:
pick = random.randint(0,count)
if pick==0:
(pickIndex, pickItem) = (count, item)
count += 1
return (pickIndex, pickItem)
def doTest(name, items, testSize, randPick):
hits = [ 0 for i in range(len(items)) ]
for i in range(testSize):
(index, item) = randPick(items)
hits[index] += 1
print(name + ': ' + ' '.join([ str(i) for i in hits ]))
def main():
items = [ i+100 for i in range(10) ]
testSize = 1000
doTest("Rand from list", items, testSize, randomPick1)
doTest("Rand from iter", items, testSize, randomPick2)
main()
Test results:
Rand from list: 97 99 94 105 108 107 93 97 93 107 Rand from iter: 121 101 88 108 106 107 90 83 100 96 Rand from list: 92 106 82 109 87 121 98 110 92 103 Rand from iter: 98 97 116 99 90 100 100 118 80 102 Rand from list: 107 92 90 103 113 90 94 95 113 103 Rand from iter: 109 95 90 91 96 104 96 104 100 115
Only eyeballing, it looks reasonably sound. I'd be happy for confirmation either way, I'm just running on intuition and what my computer tells me.
#10
Re: Most efficient way to go through a file
Posted 07 January 2013 - 07:46 AM
#11
Re: Most efficient way to go through a file
Posted 07 January 2013 - 08:47 AM
#12
Re: Most efficient way to go through a file
Posted 07 January 2013 - 10:55 AM
Try something like:
SELECT * FROM quotes ORDER BY RANDOM() LIMIT 1;
#13
Re: Most efficient way to go through a file
Posted 07 January 2013 - 12:02 PM
import random, sqlite3
quotelist = []
conn = sqlite3.connect('quotes.sqlite3')
with conn:
cur = conn.cursor()
cur.execute("select * from QUOTES")
rows = cur.fetchall()
for r in rows:
quotelist.append(r[1])
cur.close()
print(random.choice(quotelist))
#14
Re: Most efficient way to go through a file
Posted 07 January 2013 - 01:39 PM
Right, the first code I offered would be for something like:
def init(conn):
conn.executescript('''
create table foo (
foo_id integer primary key,
msg text unique
);
insert into foo(msg) values('Alpha');
insert into foo(msg) values('Bravo');
insert into foo(msg) values('Charlie');
insert into foo(msg) values('Delta');
insert into foo(msg) values('Echo');
insert into foo(msg) values('Foxtrot');
insert into foo(msg) values('Golf');
insert into foo(msg) values('Hotel');
insert into foo(msg) values('India');
insert into foo(msg) values('Juliet');
''')
def randomPick(conn):
cur = conn.cursor()
cur.execute('select msg from foo order by random() limit 1')
result = cur.fetchone()[0]
cur.close()
return result
conn = sqlite3.connect(":memory:")
init(conn)
for i in range(10):
print randomPick(conn)
It doesn't to any dup prevention, but it's simple.
Another option might be to just leave a cursor open:
def randomPickWithState(conn):
cur = conn.cursor()
cur.execute('select msg from foo order by random()')
row = cur.fetchone()
while row:
yield row[0]
row = cur.fetchone()
cur.close()
conn = sqlite3.connect(":memory:")
init(conn)
gen = randomPickWithState(conn)
for item in gen:
print item
This could get messy, though. If that cursor collapses, it's game over.
A lookup table for a given user would be the most reliable, if more complex.
def init(conn):
conn.executescript('''
create table foo (
foo_id integer primary key,
msg text unique
);
insert into foo(msg) values('Alpha');
insert into foo(msg) values('Bravo');
insert into foo(msg) values('Charlie');
insert into foo(msg) values('Delta');
insert into foo(msg) values('Echo');
insert into foo(msg) values('Foxtrot');
insert into foo(msg) values('Golf');
insert into foo(msg) values('Hotel');
insert into foo(msg) values('India');
insert into foo(msg) values('Juliet');
create table session (
seq integer primary key,
foo_id integer
);
''')
def getSeqMsg(conn):
cur = conn.cursor()
cur.execute('''
select a.seq, b.msg
from session a
left outer join foo b on a.foo_id=b.foo_id
where a.seq = ( select min(seq) from session );
''')
row = cur.fetchone()
cur.close()
if not row:
conn.executescript('''
delete from session;
insert into session(foo_id)
select foo_id
from foo
order by random();
''')
row = getSeqMsg(conn)
return row
def randomPick(conn):
(seq, msg) = getSeqMsg(conn)
cur = conn.cursor()
cur.execute('delete from session where seq = ?;', (seq,))
cur.close()
return msg
conn = sqlite3.connect(":memory:")
init(conn)
for i in range(20):
print randomPick(conn)
This has the advantage of state. You will never get a repeat until you're done with a list, even on reboot.
#15
Re: Most efficient way to go through a file
Posted 07 January 2013 - 06:17 PM
|
|

New Topic/Question
Reply




MultiQuote





|