7 Replies - 735 Views - Last Post: 29 August 2011 - 10:22 PM Rate Topic: -----

#1 tootypegs  Icon User is offline

  • D.I.C Head

Reputation: 1
  • View blog
  • Posts: 239
  • Joined: 09-October 07

Basic comparison of lists

Posted 22 August 2011 - 08:26 AM

I have 2 lists which I have populated with unknown content through file processing, for the example just say I have 'listofwords' and 'listofterms'. I want to see if any items in 'listofwords' appear as any part of any items stored in 'listofterms'. Basically I want to see if any item exists as a substring in any of the lists.

'listofwords' - dog, cat, mouse, rattrap
'listofterms' -dogbowl, rat, stick

So with the above figures I would like to highlight that 'rat' and 'dog' are common to both lists. Is there a way of comparing values?


So far I have my 2 lists populated - might be different variable names but the concept is the same.

for line in CompareCBR.readlines():
            spos = line.find(': - ')
            if spos > -1:
              epos = line.find('\n')
              if epos > -1:
                resource = line[spos +4 : epos]
            ListofText.append(resource)   
            
                        
for lines in Checkformatches.readlines():
            spos1 = lines.find('rdf:ID="')
            if spos1 > -1:
              epos1 = lines.find('">')
              if epos1 > -1:
                resources = lines[spos1 +8 : epos1]
                print resources
            ListofOWLwords.append(resources)




Cheers guys

Is This A Good Question/Topic? 0
  • +

Replies To: Basic comparison of lists

#2 Motoma  Icon User is offline

  • D.I.C Addict
  • member icon

Reputation: 452
  • View blog
  • Posts: 796
  • Joined: 08-June 10

Re: Basic comparison of lists

Posted 22 August 2011 - 08:32 AM

The simple approach would be to iterate over both lists checking each item:


matches = []

for search_item in list1:
  for match_item in list2:
    if search_item in match_item:
      matches.append(search_item)
      break

for search_item in list2:
  for match_item in list1:
    if search_item in match_item:
      matches.append(search_item)
      break



More advanced methods would entail using dictionaries of n-grams, and is probably overkill for what you need.
Was This Post Helpful? 0
  • +
  • -

#3 Python_4_President  Icon User is offline

  • D.I.C Regular

Reputation: 53
  • View blog
  • Posts: 321
  • Joined: 13-August 11

Re: Basic comparison of lists

Posted 23 August 2011 - 05:20 PM

import re

list1 = ['thief', 'white-collar', 'brown', 'elderberry', 'berry', 'ice', 'cake', 'jule']
list2 = ['brown-dwarf', 'jule-thief', 'white-collar crime', 'berry pie cake icecream', 'brown-noise']

match_list = [word for word in list1 if re.search(word, str(list1))]





Less is more.
Was This Post Helpful? 0
  • +
  • -

#4 Python_4_President  Icon User is offline

  • D.I.C Regular

Reputation: 53
  • View blog
  • Posts: 321
  • Joined: 13-August 11

Re: Basic comparison of lists

Posted 23 August 2011 - 05:57 PM

Well, apparently I can't edit that post... so here is the rest of it.


OPs code reads more like C than like Python... Here's an explanation of list comprehension and a brief one of regular expressions in case OP has never heard of them before.



brackets '[]' form lists, as you're well aware. Naturally, a list comprehension gives you a list.

This list is built of <word>s.

<word>s are gotten by iterating through a list and optionally checking for some condition.

[<word> for <word> in <word_list> if <word> is spanish or <word> == 'blue' or <word> in <bad_word_list> and <word> not in <facebook_post>] 



You can also modify <word> before it gets added to the returned list...
[<word>.lower().upper().strip('aeiou') for <word> in <Annual_Financial_Report> if <word> in <panic_phrases>]





and regular expressions are a good way to find more complicated things.

dicstr = "Regular expressions are the single coolest quantity in the known universe..." 
results = re.search("([rR]egular).*(are).*(cool).*(universe).*", dicstr).groups()
print results

four_word_pattern = re.compile("([rR]egular).*(are).*(cool).*(universe).*")
results = four_word_pattern.search(dicstr).groups()
print results


Was This Post Helpful? 1
  • +
  • -

#5 chemicalfan  Icon User is offline

  • D.I.C Head

Reputation: 4
  • View blog
  • Posts: 88
  • Joined: 16-October 09

Re: Basic comparison of lists

Posted 24 August 2011 - 12:59 AM

Is the re module quicker than using straight-up iteration? I've always thought importing modules was best avoided if reasonably possible (not sure why though!)
Was This Post Helpful? 0
  • +
  • -

#6 Python_4_President  Icon User is offline

  • D.I.C Regular

Reputation: 53
  • View blog
  • Posts: 321
  • Joined: 13-August 11

Re: Basic comparison of lists

Posted 24 August 2011 - 06:34 PM

Well... Let's think about how all this crazy stuff works in the first place..

imports get added to a hash table which contains the hashed values for all of the everythings in your program that reference something else. (that's why print != Print, and True != true)

When you import something, the hash table grows. The large the hash table is, the slower the lookups could potentially be (these things are super-mega-really optimized).


Python allows you lots of ways to control how things are imported.


Consider this example:

from classfile_with_a_million_methods import *

print poorly_formatted_report()





Here, we use exactly one method from a class file containing at least a million methods, but we import the entire file.
Now, when poorly_formatted_report is looked up, an additional million entries or so must be considered.

However, we can control what we import:
from classfile_with_a_million_methods import poorly_formatted_report

print poorly_formatted_report()





If you think of functions/methods as tools like screwdrivers and barrel vises, and classes like realms of technology (IE, my surfacemount rework kit, my high-volume wood processing kit, my truck kit, etc), then "import *" is like saying, "I'm gunna bring along my entire friggin' garage."

That's pretty silly if all you need to do is draw a straight line on a piece of paper.

You can avoid the overhead of looking up a function defined in an external file by assigning the result of the lookup to a variable. This removes the need to look up the function name in a hash table every time you call the function.
Instead, the lookup is performed once and a direct reference to the function is stored in local memory.


So, for example:
import math
import time

MAX_REPLAY = 1000
MAX_TEST = 1000000

sum_of_all_lookup_times = 0
sum_of_all_nolookup_times = 0

for x in xrange(MAX_REPLAY):
    start_time_lookup = time.clock()
    for i in xrange(1,MAX_TEST):
        test = math.ceil(i/3.14159)   
    stop_time_lookup = time.clock()
    
    
    lookup_time = stop_time_lookup - start_time_lookup
    sum_of_all_lookup_times += lookup_time
    
    
    start_time_no_lookup = time.clock()
    
    #Here's the thing I'm talking about... 
    turbo = math.ceil
    
    
    for i in xrange(1, MAX_TEST):
        test = turbo(i/3.14159)
    stop_time_no_lookup = time.clock()
    
    
    nolookup_time = stop_time_no_lookup - start_time_no_lookup
    sum_of_all_nolookup_times += nolookup_time
    
    
    print "Lookup completed in: ", lookup_time
    print "Nolookup completed in: ", nolookup_time
    
    
    

print "If you call functions in loops, then your time on ", MAX_TEST, " lookups is: ", sum_of_all_lookup_times
print "If you assign the result of the lookup to a variable, then your time on ", MAX_TEST, " iterations is: ", sum_of_all_nolookup_times






You can see that the former takes a bit longer than the latter.

Here are my results:
Average Lookup completed in: 0.29226998897 seconds
Average Nolookup completed in: 0.264197716412 seconds
If you call functions in loops, then your time on 1000000 lookups is: 292.0536114 seconds
If you assign the result of the lookup to a variable, then your time on 1000000 iterations is: 262.922570567 seconds



So let's look at regular expressions then...






A list of words and things...
big_list = """
View this email with images.


American Science & Surplus - Incredible Stuff, Unbelievable Prices!
Specials     Specials     New Stuff     On Sale     Must Go
Specials
x     
Shop by Category
-     

Adult & Kid Toys
Arts & Crafts
Batteries & Power
Communications & Electronics
Containers
Drives & Wheels
Electrical Parts
Hardware
House & Garden
Kits & Models
Lab Supplies & Equipment
Lamps, Sockets & Wiring
Materials, Magnets
Militaria
Motors, Blowers & Pumps
Office Supplies
Optics
Out & About
School Items
Switches
Tools
Robot Partz
Help Project Exploration
    -
-- 
    x     
Great sale & surplus items are like Cubs' World Series tickets...they don't come very often, so grab the chance when you can.


AS&S...Offering bargains since Rabbit Warstler played shortstop for the Boston Bees.
-
CELESTRON SPOTTING SCOPE<br>15-45X zoom!<br>With tripos and aluminum carry case!<br>ON $77.95<br>(wa     x     

CELESTRON SPOTTING SCOPE
15-45X zoom!
With tripod and aluminum carry case!
ON SALE! $77.95
(was $89.95)

Product Details Go
-
WET/DRY AUTO VACUUM<br>Plugs into cig lighter!<br>9 foot cord!<br>ON SALE! $8.95<br>(was $9.95)     x     

WET/DRY AUTO VACUUM
Plugs into cig lighter!
Van-friendly 9 foot cord!
ON SALE! $8.95
(was $9.95)

Product Details Go
-
VINTAGE HAVERSACK<br>US Army Infantry circa 1898<br>Spanish-American War era!<br>ON $69.50<br>(was     x     

VINTAGE HAVERSACK
US Army Infantry circa 1898!
Spanish-American War era!
ON SALE! $69.50
(was $75.00)

Product Details Go
-
SURVIVAL BINOCULARS<br>Plus compass, signal mirror, LED light!<br>Avoid 127 hours scen!<br>ON<br>(wa     x     

SURVIVAL BINOCULARS
Plus compass, signal mirror, LED light!
Avoid 127 hours scenarios!
ON SALE! $3.95
(was $4.95)

Product Details Go
-


UHF-HDTV ANTENNA

Amplified too!
ON SALE! $22.50
(was $24.50)

Product Details Go
-
MONTY PYTHON CARD GAME<br>Silly walk over and buy this!<br>That no ordinary card game!<br>ON<br>(was     x     

MONTY PYTHON CARD GAME
Silly walk over and buy this!
That's no ordinary card game!
ON SALE! $15.00
(was $17.95)

Product Details Go
-
4-PIECE LOUPE SET<br>2X to 8X!<br>With storage case!<br>ON SALE! $5.00<br>(was $6.50)     x     

4-PIECE LOUPE SET
2X to 8X!
With storage case!
ON SALE! $5.00
(was $6.50)

Product Details Go
-
TANK PERISCOPE PRISM<br>Enourmous optical quality glass!<br>Rainbows for everyone!<br>ON<br>(was $10     x     

TANK PERISCOPE PRISM
Enormous optical quality glass!
Rainbows for everyone!
ON SALE! $7.50
(was $10.75)

Product Details Go
-
    x
Specials
-- 

American Science & Surplus
7410 N. Lehigh Ave
Niles, IL 60714
    888-SCIPLUS (888-724-7587)
Annual sale ends August 22, 2011
Call Us! THANKS!

 


Click Here to be removed from this list
""".split()





SRC for the regex demo:


import time
import re


regex = "PYTHON"


MAX_REPLAY = 1000
MAX_TEST = 100000

sum_of_all_regex_times = 0
sum_of_all_normal_times = 0
sum_of_all_awesome_times = 0
sum_of_all_normal2_times = 0


###############################################################
for x in xrange(MAX_REPLAY):
    start_regex = time.clock()
    for element in big_list:
        if(re.search(regex, element)):
            found = element
    print found
    stop_regex = time.clock()
    regex_time = stop_regex - start_regex
    sum_of_all_regex_times += regex_time
###############################################################


###############################################################   
    regex2_time = time.clock()
    found = [word for word in big_list if re.search(regex, word)]
    print found
    regex2_stop_time = time.clock()
    total_regex2 = regex2_stop_time - regex2_time
    sum_of_all_awesome_times += total_regex2
###############################################################  
    
   
############################################################### 
    start_time_normal = time.clock()   
    if regex in big_list:
        found = big_list[big_list.index(regex):big_list.index(regex)+len(regex)]
    print found
    stop_time_normal = time.clock()
    normal_time = stop_time_normal - start_time_normal
    sum_of_all_normal_times += normal_time
###############################################################


###############################################################
    start_time_normal2 = time.clock()
    for element in big_list:    
        if regex in element:
            found = element
    print found
    stop_time_normal2 = time.clock()
    normal2_time = stop_time_normal2 - start_time_normal2
    sum_of_all_normal2_times += normal2_time
###############################################################
    
    print "-"*40
    print "regex completed in: ", regex_time, " seconds"
    print "elementwise in test completed in: ", normal_time, " seconds"
    print "straight_up iteration completed in: ", normal2_time, " seconds"
    print "regex2 completed in ", total_regex2, " seconds"
    print "-"*40
    
print "\n\n"  
print '|'*80
print "regex2 finished ", MAX_TEST, " iterations in: ", sum_of_all_awesome_times, " seconds"
print "Regular Regular Expressions finish ", MAX_TEST, " iterations in: ", sum_of_all_regex_times, " seconds"
print "In test finishes ", MAX_TEST, " iterations in: ", sum_of_all_normal_times, " seconds"
print "elementwise in test finishes ", MAX_TEST, " iterations in: ", sum_of_all_normal2_times, " seconds"
print '|'*80






times for regex techniques and iterative techniques

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
regex2 finished 100000 iterations in: 0.507297829685 seconds
Regular Regular Expressions finish 100000 iterations in: 0.469704052771 seconds
In test finishes 100000 iterations in: 0.171675723601 seconds
elementwise in test finishes 100000 iterations in: 0.0573768939973 seconds
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



So, yes, "straight-up" iteration goes pretty quick. However, it becomes more difficult to look for multiple things, or multiple types of things, with just string/list builtins, and it is here that regular expressions quickly begin to surpass other methods.

This post has been edited by Python_4_President: 24 August 2011 - 06:38 PM

Was This Post Helpful? 2
  • +
  • -

#7 baavgai  Icon User is offline

  • Dreaming Coder
  • member icon

Reputation: 5848
  • View blog
  • Posts: 12,707
  • Joined: 16-October 07

Re: Basic comparison of lists

Posted 25 August 2011 - 05:19 AM

Tests? Welcome to DIC, where we love tests!

You left out one I wanted to see, though. Also, you kind of lied to me. Your code loops MAX_REPLAY times, but tells me it did MAX_TEST times.

The test I wanted to see was compiled regex. Because, honestly, if your were using regex on anything significant, that's what you'd use. Also, I'd prefer time.time(). Which led me to the next thing, compulsively rewriting you code.

Here's the code:
regex = "PYTHON"
regexCompiled = re.compile(regex)

def timeProcess(func, printResult):
	# start_time = time.clock()
	start_time = time.time()
	result = func()
	# elapsed = time.clock() - start_time
	elapsed = time.time() - start_time
	if printResult:
		print result
	return elapsed

def testRegex():
	for element in big_list:
		if(re.search(regex, element)):
			return element

def testRegex2():
	return [word for word in big_list if re.search(regex, word)]

def testRegexReal():
	for element in big_list:
		if(regexCompiled.search(element)):
			return element

def testNormal():
	if regex in big_list:
		return big_list[big_list.index(regex):big_list.index(regex)+len(regex)]
	return None

def testNormal2():
	for element in big_list:    
		if regex in element:
			return element
	return None

def doSingleTestIteration(tests, printResult):
	def show(s): 
		if printResult: 
			print(s)
	def testElapsed(name, func):
		elapsed = timeProcess(func, printResult)
		show("{} completed in: {} seconds".format(name, elapsed))
		return elapsed
	show("-"*40)
	results = [ testElapsed(name, func) for (name, func) in tests ]
	show("-"*40)
	return results

def doTests(MAX_TEST, printResult = False):
	tests = ( 
		("regex2", testRegex), ("Regular Regular Expressions", testRegex),
		("In test finishes", testNormal), ("elementwise", testNormal2),
		("regex3", testRegexReal),
		)
	testsIndex = range(len(tests))
	totals = [ 0 for i in testsIndex ]
	for i in range(MAX_TEST):
		runTotal = doSingleTestIteration(tests, printResult)
		for j in testsIndex:
			totals[j] += runTotal[j]
			
	print "\n\n"  
	print '|'*80
	for i in testsIndex:
		name = tests[i][0]
		print "{} finished {} iterations in: {} seconds".format(name, MAX_TEST, totals[i])
	print '|'*80

doTests(1000)



Results:
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
regex2 finished 1000 iterations in: 0.802315235138 seconds
Regular Regular Expressions finished 1000 iterations in: 0.801429271698 seconds
In test finishes finished 1000 iterations in: 0.0280725955963 seconds
elementwise finished 1000 iterations in: 0.0402429103851 seconds
regex3 finished 1000 iterations in: 0.170344114304 seconds
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



Taking out the compile time of even the simplest expression clearly improves things a bit.
Was This Post Helpful? 1
  • +
  • -

#8 Python_4_President  Icon User is offline

  • D.I.C Regular

Reputation: 53
  • View blog
  • Posts: 321
  • Joined: 13-August 11

Re: Basic comparison of lists

Posted 29 August 2011 - 10:22 PM

Yes, compiling does shave off a few 100ths of a second.


UPC_Compiled: 0.307229042053
UPC JIT: 0.311157226562
SISA_Compiled: 0.908787488937
SISA JIT: 0.915666103363
RISA_Compiled: 0.872503995895
RISA JIT: 0.880665779114



Now, who knows the best way to achieve what I've done in the following code without using regular expressions?

For example, find all the UPCs for Summer Fresh products...


import re
import time

cumulative_upc_compiled_time = 0
cumulative_upc_jit_time = 0
cumulative_sisa_compiled_time = 0
cumulative_sisa_jit_time = 0
cumulative_risa_compiled_time = 0
cumulative_risa_jit_time = 0


edi = "ISA*00*          *00*          *12*12345678       *12*1234567890    *110815*1623*U*00400*0000004*0*P*>^GS*GP*12345678*1234567890*20110815*1623*4*X*004010UCS^ST*880*0002^G01*20110815*0061785*20110801*55555555^G62*10*20110810^G27*M****COMMON^G23*01*3***2**10**30*33242*1836684****2% 10 DAYS NET 30^G25*PP*02*DEST^N1*ST*SOME STORE*9*058796333444^N3*111 Some Street^N4*Some City*SS*55555^N1*RE*Some Other Store*9*66666666^N3*Some Other Street.^N4*H-Town*TX*77099^N1*BT*Someone to Bill,*9*34234555234234^N3*Some Address^N4*Some city*AK*66066^G17*81*CA*15.68*123456789*UP*123456789****81*CA^G69*LAVENDER^G20*12^G72*90*02***1.57*81*CA^G17*75*CA*14.41*123456780*UP*123456783****75*CA^G69*DRAIN^G20*9^G72*90*02***1.44*75*CA^G17*156*CA*15.68*123456781*UP*123456783****156*CA^G69*PINE^G20*12^G72*90*02***1.57*156*CA^G17*105*CA*24*123456785*UP*123456788****105*CA^G69*LAVENDER PASSION^G20*6^G72*90*02***2.4*105*CA^G17*70*CA*24*123456782*UP*123456783****70*CA^G69*CARRIBEAN BREEZE^G20*6^G72*90*02***2.4*70*CA^G17*90*CA*14*123456786*UP*123456787****90*CA^G69*NATURE FRESH^G20*12^G72*90*02***1.4*90*CA^G17*90*CA*14*123456783*UP*123456785****90*CA^G69*SUMMER FRESH^G20*12^G72*90*02***1.4*90*CA^G17*270*CA*14*123456782*UP*123456783****270*CA^G69*SPRING FRESH^G20*12^G72*90*02***1.4*270*CA^G17*83*CA*14.2*123456786*UP*123456782****83*CA^G69*BABY ESSENCE^G20*12^G72*90*02***1.42*83*CA^G17*180*CA*14*123456782*UP*123456783****180*CA^G69*VIOLET BOUQUET^G20*12^G72*90*02***1.4*180*CA^G17*35*CA*21.84*123456782*UP*123456784****35*CA^G69*DISINFECTANT^G20*6^G72*90*02***2.18*35*CA^G17*60*CA*16.96*123456786*UP*123456788****60*CA^G69*DISINFECTANT^G20*15^G72*90*02***1.7*60*CA^G31*1295*CA*41423.74*LB*1261.63*CF^G33*4454555^SE*66*0002^GE*1*4^IEA*1*0000004^"


UPC_pattern = re.compile("[0-9]{9}")
Sender_pattern = re.compile("[a-zA-Z0-9 *]{35}([0-9 ]{14})")
Receiver_pattern = re.compile("[a-zA-Z0-9 *]{54}([0-9 ]{14})")


for i in xrange(5000):
    start = time.time()
    UPC_pattern.findall(edi)
    upc_compiled_time = time.time() - start
    
    
    start = time.time()
    re.findall("[0-9]{9}", edi)
    upc_jit_time = time.time() - start
    
    start = time.time()
    Sender_pattern.findall(edi)
    sisa_compiled_time = time.time() - start
    
    start = time.time()
    re.findall("[a-zA-Z0-9 *]{35}([0-9 ]{14})", edi)
    sisa_jit_time = time.time() - start
    
    start = time.time()
    Receiver_pattern.findall(edi)
    risa_compiled_time = time.time() - start
    
    start = time.time()
    re.findall("[a-zA-Z0-9 *]{54}([0-9 ]{14})", edi)
    risa_jit_time = time.time() - start
    
    
    cumulative_upc_compiled_time += upc_compiled_time
    cumulative_upc_jit_time += upc_jit_time
    
    cumulative_sisa_compiled_time += sisa_compiled_time
    cumulative_sisa_jit_time += sisa_jit_time
    
    cumulative_risa_compiled_time += risa_compiled_time
    cumulative_risa_jit_time += risa_jit_time
    
    
    
    
print "UPC_Compiled: " + str(cumulative_upc_compiled_time)
print "UPC JIT: " + str(cumulative_upc_jit_time)
print "SISA_Compiled: " + str(cumulative_sisa_compiled_time)
print "SISA JIT: " + str(cumulative_sisa_jit_time)
print "RISA_Compiled: " + str(cumulative_risa_compiled_time)
print "RISA JIT: " + str(cumulative_risa_jit_time)



This post has been edited by Python_4_President: 29 August 2011 - 10:24 PM

Was This Post Helpful? 0
  • +
  • -

Page 1 of 1