14 Replies - 6450 Views - Last Post: 07 January 2013 - 06:17 PM Rate Topic: -----

#1 blank_program  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 11
  • View blog
  • Posts: 284
  • Joined: 22-July 09

Most efficient way to go through a file

Posted 06 January 2013 - 03:30 PM

I have an idea for a small program that will display random lines from a text file, a database and XML would be far too heavy for this project, however the file may be large. I want to read out a single line and display it but it has to be random. For now it doesn't matter how many times a quote was displayed but int he future I may add this so maybe SQLite would be good I am unsure and for simplicity a flat text file will work for now.

I am thinking loading the whole thing each time would be a bit bunch but if I want to move between quotes I need to read it all out into a list. Would that be best? Read the entire file to a list then randomly choose from the list or query the file each time?

Is This A Good Question/Topic? 0
  • +

Replies To: Most efficient way to go through a file

#2 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 759
  • View blog
  • Posts: 2,010
  • Joined: 23-December 08

Re: Most efficient way to go through a file

Posted 06 January 2013 - 03:37 PM

This is pseudo code, but something like this would probably work
from random import shuffle
with open(filename) as f:
    lines = f.readlines()
shuffle(lines)

print(lines.pop()) #prints a random line and removes it from the list of lines



EDIT:
From the requirements you listed, this should work. However, give us more detail on the application and I can probably give you a better solution :-P

This post has been edited by atraub: 06 January 2013 - 03:43 PM

Was This Post Helpful? 1
  • +
  • -

#3 darek9576  Icon User is offline

  • D.I.C Lover

Reputation: 198
  • View blog
  • Posts: 1,693
  • Joined: 13-March 10

Re: Most efficient way to go through a file

Posted 06 January 2013 - 04:13 PM

Does Python support reading a random line of the file?

f.readlines() may cause problems if the file is large. Or maybe i am talking rubbish.

Remember that you can always test the performance of a method using timeit module, i.e.

import timeit
timerObject = timeit.Timer(stmt = "", setup = "from __main__ import method")
totalTime = timerObject.timeit(number = 1000)



This post has been edited by darek9576: 06 January 2013 - 04:17 PM

Was This Post Helpful? 0
  • +
  • -

#4 blank_program  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 11
  • View blog
  • Posts: 284
  • Joined: 22-July 09

Re: Most efficient way to go through a file

Posted 06 January 2013 - 07:31 PM

View Postatraub, on 06 January 2013 - 05:37 PM, said:

This is pseudo code, but something like this would probably work
from random import shuffle
with open(filename) as f:
    lines = f.readlines()
shuffle(lines)

print(lines.pop()) #prints a random line and removes it from the list of lines



EDIT:
From the requirements you listed, this should work. However, give us more detail on the application and I can probably give you a better solution :-P

I just want a small windows application to pull random quotes from a file and display them. I think your solution will work to prevent the same from being displayed mroe than once per time the application is started. However I will need to test the length of the list and if zero then reread the file. Or have two lists.
Was This Post Helpful? 0
  • +
  • -

#5 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 759
  • View blog
  • Posts: 2,010
  • Joined: 23-December 08

Re: Most efficient way to go through a file

Posted 06 January 2013 - 11:24 PM

Well no. To my knowledge, Python does not natively support going to the nth line, buuut if you're clever there are some workarounds. I can think of 1 off the top of my head... but I don't particularly like it. As you read this, keep in mind that by sharing this with you, I am in no way suggesting I like this approach or think it's a particularly good idea. I think a sqlite database is probably better for this sort of thing at this point. That being said, this would stay in line with your requirements


find a random number between 0 and the length of the file in bytes
have the file object set it's pointer to that spot
find the end of the current line
read the next line
example:
>>> import os
>>> os.path.getsize(filename)
117635
>>> import random
>>> spot = random.randint(0,os.path.getsize(filename))
>>> spot
116192
>>> f.seek(spot)
>>> f.readline()
/\/\/\/\/\/\/\/\/
>>> rand_line = f.readline()


1 possible gotcha is if "spot" is in the last line of your file, in which case you'll need a new spot.
Was This Post Helpful? 0
  • +
  • -

#6 sepp2k  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 2151
  • View blog
  • Posts: 3,308
  • Joined: 21-June 11

Re: Most efficient way to go through a file

Posted 07 January 2013 - 04:24 AM

Note that using that approach, lines that come after long lines will be more likely to be picked than lines coming after short lines. And the first line will never be picked (except if change the logic so that if the random spot is on the last line, it chooses the first line instead of picking a new spot).

If you want to pick a random line with equal probability and you don't want to/can't read the file into memory, you'll probably have to do something like this:

  • Iterate over the lines to count them
  • Pick a random number between 0 and the number of lines (exclusive)
  • Iterate over the lines again with an index and return the current line when the index equals the random number


To avoid picking duplicate lines, you'd need to keep a list of lines that you have already picked. So every time you choose a line number, you check whether its in that list. If it is, pick another one.

Alternatively to iterating over the file twice, you could store the starting indices of each line in a list when you iterate the first time. Then after picking a random number, you can directly seek to the starting index of that line. This will consume more memory than the first approach though (one integer per line in the file - so still significantly less memory than reading the whole file assuming the lines aren't ultra short), so you'd be trading runtime for memory consumption here.

Using this approach you can avoid picking duplicates by removing a line's entry from the index list after picking it.

PS: The most efficient way as far as runtime goes is to read the whole file into memory using readlines as atraub initially suggested. Only if the file is too large to fit into memory, do the other alternatives make sense.

This post has been edited by sepp2k: 07 January 2013 - 04:30 AM

Was This Post Helpful? 0
  • +
  • -

#7 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 5908
  • View blog
  • Posts: 12,814
  • Joined: 16-October 07

Re: Most efficient way to go through a file

Posted 07 January 2013 - 06:16 AM

You can go to a random position in a file, but since your lines are variable length, it's hard to know where to go...

If you're going to read all the lines to count them, then read again for the seek, you're in the worse case scenario for speed, but memory won't be clobbered by super large files.

If you read all the lines into memory, then pick from that list, memory is hit hardest, but you'll win on speed as long as you have it.

I like to split the difference between the two: Read all the lines, keep that running count. However, for each line read, you do a random call based on the total read thus far. If you roll 0, keep the line. This should give you an random distribution across the file.

Best case for a flat file? Make an index file. The index file simply consists of the starting positions of each line in the data file. Read that into memory and randomly pick. Use the position to seek the line in the file. You can read the index once and remove values that you've used, for that effect you want.

Hope this helps.
Was This Post Helpful? 0
  • +
  • -

#8 sepp2k  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 2151
  • View blog
  • Posts: 3,308
  • Joined: 21-June 11

Re: Most efficient way to go through a file

Posted 07 January 2013 - 06:52 AM

View Postbaavgai, on 07 January 2013 - 02:16 PM, said:

I like to split the difference between the two: Read all the lines, keep that running count. However, for each line read, you do a random call based on the total read thus far. If you roll 0, keep the line. This should give you an random distribution across the file.


When you say "keep the line" do you mean "select that line as your randomly chosen line"? If so, won't that always select the first line? Or at least select early lines with a much higher probability than later ones? Because the total read so far after the first line will be 1 and randrange(1) will always give you 0, while randint(0,1) will still give you a 50% chance of 0 - and if the first line has a 50% chance of being picked that's clearly not an equal distribution if there are more than 2 lines.
Was This Post Helpful? 0
  • +
  • -

#9 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 5908
  • View blog
  • Posts: 12,814
  • Joined: 16-October 07

Re: Most efficient way to go through a file

Posted 07 January 2013 - 07:31 AM

View Postsepp2k, on 07 January 2013 - 08:52 AM, said:

If so, won't that always select the first line? Or at least select early lines with a much higher probability than later ones?


Granted, it's initially counter intuitive. Sort of. Rather than thinking of it as a connected set, think of each random choosing as unconnected. Yes, the first item get's picked first. Then the next item has a 50% chance of knocking it off. The next, 1/3 chance and so on. Each item should have the same chance of being picked.

That is, I think. I don't have a mathematical proof handy, but I do have a computer.

Test program:
import random

def randomPick1(items):
	index = random.randint(0,len(items)-1)
	return (index, items[index])

def randomPick2(items):
	count = 0
	(pickIndex, pickItem) = (0, None)
	for item in items:
		pick = random.randint(0,count)
		if pick==0:
			(pickIndex, pickItem) = (count, item)
		count += 1
	return (pickIndex, pickItem)

def doTest(name, items, testSize, randPick):
	hits = [ 0 for i in range(len(items)) ]
	for i in range(testSize):
		(index, item) = randPick(items)
		hits[index] += 1
	print(name + ': ' + ' '.join([ str(i) for i in hits ]))

def main():
	items = [ i+100 for i in range(10) ]
	testSize = 1000
	doTest("Rand from list", items, testSize, randomPick1)
	doTest("Rand from iter", items, testSize, randomPick2)

main()



Test results:
Rand from list: 97 99 94 105 108 107 93 97 93 107
Rand from iter: 121 101 88 108 106 107 90 83 100 96

Rand from list: 92 106 82 109 87 121 98 110 92 103
Rand from iter: 98 97 116 99 90 100 100 118 80 102

Rand from list: 107 92 90 103 113 90 94 95 113 103
Rand from iter: 109 95 90 91 96 104 96 104 100 115



Only eyeballing, it looks reasonably sound. I'd be happy for confirmation either way, I'm just running on intuition and what my computer tells me. :P
Was This Post Helpful? 1
  • +
  • -

#10 sepp2k  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 2151
  • View blog
  • Posts: 3,308
  • Joined: 21-June 11

Re: Most efficient way to go through a file

Posted 07 January 2013 - 07:46 AM

View Postbaavgai, on 07 January 2013 - 03:31 PM, said:

Then the next item has a 50% chance of knocking it off.


Ah! So you don't just return an element if the random number was 0. That's what I missed from the initial description.
Was This Post Helpful? 0
  • +
  • -

#11 blank_program  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 11
  • View blog
  • Posts: 284
  • Joined: 22-July 09

Re: Most efficient way to go through a file

Posted 07 January 2013 - 08:47 AM

Alright so I am using sqlite3 and I have a column named ID which stares a number, then the next column is quotes. I am trying to figure out how to read the max number of rows only at the start so I can then pull just the item out that I need but am having no luck. Otherwise my random number could be outside the max number of rows in the table.
Was This Post Helpful? 0
  • +
  • -

#12 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 5908
  • View blog
  • Posts: 12,814
  • Joined: 16-October 07

Re: Most efficient way to go through a file

Posted 07 January 2013 - 10:55 AM

Well, hell, that just makes it too easy. :P

Try something like:
SELECT * FROM quotes ORDER BY RANDOM() LIMIT 1;


Was This Post Helpful? 0
  • +
  • -

#13 blank_program  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 11
  • View blog
  • Posts: 284
  • Joined: 22-July 09

Re: Most efficient way to go through a file

Posted 07 January 2013 - 12:02 PM

Here we go with my solution. Spent some time with Google trying to get this. I figure it will work fine enough since I won't have that much data in this database for the time being and should be okif I do add more. Critiques welcome.

import random, sqlite3

quotelist = []

conn = sqlite3.connect('quotes.sqlite3')
with conn:
    cur = conn.cursor()
    cur.execute("select * from QUOTES")

    rows = cur.fetchall()

    for r in rows:
        quotelist.append(r[1])
    
    cur.close()

print(random.choice(quotelist))


Was This Post Helpful? 0
  • +
  • -

#14 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 5908
  • View blog
  • Posts: 12,814
  • Joined: 16-October 07

Re: Most efficient way to go through a file

Posted 07 January 2013 - 01:39 PM

Ah, yeah, with that fetch all you're kind of right back where you started.

Right, the first code I offered would be for something like:
def init(conn):
	conn.executescript('''
		create table foo (
			foo_id integer primary key, 
			msg text unique
		);
		insert into foo(msg) values('Alpha');
		insert into foo(msg) values('Bravo');
		insert into foo(msg) values('Charlie');
		insert into foo(msg) values('Delta');
		insert into foo(msg) values('Echo');
		insert into foo(msg) values('Foxtrot');
		insert into foo(msg) values('Golf');
		insert into foo(msg) values('Hotel');
		insert into foo(msg) values('India');
		insert into foo(msg) values('Juliet');
		''')

def randomPick(conn):
	cur = conn.cursor()
	cur.execute('select msg from foo order by random() limit 1')
	result = cur.fetchone()[0]
	cur.close()
	return result

conn = sqlite3.connect(":memory:")
init(conn)
for i in range(10):
	print randomPick(conn)



It doesn't to any dup prevention, but it's simple.

Another option might be to just leave a cursor open:
def randomPickWithState(conn):
	cur = conn.cursor()
	cur.execute('select msg from foo order by random()')
	row = cur.fetchone()
	while row:
		yield row[0]
		row = cur.fetchone()
	cur.close()

conn = sqlite3.connect(":memory:")
init(conn)
gen = randomPickWithState(conn)
for item in gen:
	print item



This could get messy, though. If that cursor collapses, it's game over.

A lookup table for a given user would be the most reliable, if more complex.
def init(conn):
	conn.executescript('''
		create table foo (
			foo_id integer primary key, 
			msg text unique
		);
		insert into foo(msg) values('Alpha');
		insert into foo(msg) values('Bravo');
		insert into foo(msg) values('Charlie');
		insert into foo(msg) values('Delta');
		insert into foo(msg) values('Echo');
		insert into foo(msg) values('Foxtrot');
		insert into foo(msg) values('Golf');
		insert into foo(msg) values('Hotel');
		insert into foo(msg) values('India');
		insert into foo(msg) values('Juliet');
		
		create table session (
			seq integer primary key, 
			foo_id integer
		);
		''')

def getSeqMsg(conn):
	cur = conn.cursor()
	cur.execute('''
		select a.seq, b.msg
			from session a
				left outer join foo b on a.foo_id=b.foo_id
			where a.seq = ( select min(seq) from session );
		''')
	row = cur.fetchone()
	cur.close()
	if not row:
		conn.executescript('''
			delete from session;
		
			insert into session(foo_id)
				select foo_id
					from foo
					order by random();
			''')
		row = getSeqMsg(conn)
	return row


def randomPick(conn):
	(seq, msg) = getSeqMsg(conn)
	cur = conn.cursor()
	cur.execute('delete from session where seq = ?;', (seq,))
	cur.close()
	return msg

conn = sqlite3.connect(":memory:")
init(conn)
for i in range(20):
	print randomPick(conn)



This has the advantage of state. You will never get a repeat until you're done with a list, even on reboot.
Was This Post Helpful? 0
  • +
  • -

#15 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 759
  • View blog
  • Posts: 2,010
  • Joined: 23-December 08

Re: Most efficient way to go through a file

Posted 07 January 2013 - 06:17 PM

I smell a Python challenge being inspired here.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1