Help improving code efficiency

My code works, but it's slow. Your thoughts and critique.

Page 1 of 1

1 Replies - 813 Views - Last Post: 02 March 2010 - 03:30 AM Rate Topic: -----

#1 RoseWyvern  Icon User is offline

  • New D.I.C Head

Reputation: -2
  • View blog
  • Posts: 17
  • Joined: 29-December 09

Help improving code efficiency

Posted 01 March 2010 - 11:17 PM

I'm trying to speed up this program, any thoughts? For now I'm working on changing the for loops into sentinel controlled while loops.

The documentation I read suggested that urlopen can be slow, but I doubt rewriting this program in a threaded fashion would help.

Also, if you see anything that seems off in the programming, let me know; I'm self taught, and my work doesn't get critiqued very often.

#This contains the functions for scraping jobs section of the moffitt website


from urllib import *
from re import *

#This function takes a url and returns the html source

def getHTML(url):
    urlBuffer = urlopen(url)
    HTML = urlBuffer.read()
    urlBuffer.close()
    return HTML

#This function takes the main moffitt URL and turns it into the
#individual job URLs

#This function also strips off trailing 's 
def getMoffittURLs(self):
    x = findall('http://tbe.taleo.net/NA4/ats/careers/requisition.*\'',self)
    x = map (lambda y : y.replace('\'',''),x)
    return x

#function to determind if requirements are out of bounds 
def isAble(cleanSourceCode):
    #to lower
    cleanSourceCode = removeTags(cleanSourceCode)
    able = True
    notAble = "m.d. nurse phd".split()
    for i in range(len(notAble)):
        if notAble[i] in cleanSourceCode.lower():
            able = False
    return able

#removes html tags
def removeTags(feed):
	p= compile(r'<.*?>')
	return p.sub('',feed)

#returns the title 
def getTitle (feed):
    a= findall('<td colspan=2><b>.*</b>',feed)
    a=removeTags(str(a))
    return a

def main ():
    mUrl ='http://tbe.taleo.net/NA4/ats/careers/searchResults.jsp?org=MOFFITT&cws=1'
    html = getHTML(mUrl)
    links = getMoffittURLs(html)
    for i in links:
        able = True
        job = getHTML(i)
        able = isAble(job)
        if able:
            title = getTitle(job)
            print title + " is open"





Thanks

Is This A Good Question/Topic? 0
  • +

Replies To: Help improving code efficiency

#2 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Help improving code efficiency

Posted 02 March 2010 - 03:30 AM

Two thoughts:

View PostRoseWyvern, on 02 March 2010 - 06:17 AM, said:

The documentation I read suggested that urlopen can be slow, but I doubt rewriting this program in a threaded fashion would help.

Also, if you see anything that seems off in the programming, let me know; I'm self taught, and my work doesn't get critiqued very often.


Have you profiled your code to see where it's spending its time? Unless you know that, you're likely to spend hours optimizing away 99% of the run time on something that takes a millisecond (saving 0.00099 second) when there's another place where the code is spending 10 seconds that you could spend five minutes to reduce its run time by 10% (saving 1 full second).

Since you mentioned urlopen, if your code is spending 99% of its time waiting for pages to download (which is actually fairly likely), then nothing at all that you can do to your code will make that faster and your only way to get a significant improvement in runtime performance will be to reengineer your program to retrieve multiple documents in parallel. (This doesn't necessarily mean "use threads", btw. I can't tell you how to do it in Python, but the C "select" command can be used to monitor multiple connections simultaneously from within a single process/thread. I'm sure Python has something similar, quite possibly also called "select".)

#removes html tags
def removeTags(feed):
	p= compile(r'<.*?>')
	return p.sub('',feed)



Non-greedy quantifiers (.*? or .+?) can cause massive backtracking in the regex engine, which will absolutely kill regex performance. If you mean "a <, followed by any number of characters, ending at the first >", then remember that the intervening characters can't be > (since that terminates the match) and save the regex engine some work by using a negated character class instead:
	p= compile(r'<[^>]*>')


While the non-greedy quantifier version may produce the correct result (eventually), it potentially forces the regex engine to try matching each < with every > following it, then compare those matches to see which is the best (i.e., shortest) match. The negated character class version will find the match terminated by the first following > and then stop and call it good, since no other match is possible.

Of course, the best way to handle this would be to use an actual HTML parsing library because HTML is complex and has a variety of corner cases that will trip up simple regex-based attempts, but this way should work well enough for simply removing all tags in most cases.
Was This Post Helpful? 1
  • +
  • -

Page 1 of 1