1 Replies - 7703 Views - Last Post: 30 April 2010 - 04:55 AM Rate Topic: -----

#1 gretty  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 122
  • Joined: 25-May 09

Python: Get a URL using regular expressions

Posted 29 April 2010 - 11:51 PM

Hello

I am using regular expressions to grab URL's from a string(of HTML code). I am getting on very well & I seem to be grabbing the full URL but
I also get a '"' character at the end of it. Do you know how I can get rid of the '"' char at the end of my URL

Example of problem:

Quote

I get this when I extract a url from a string
http://google.com"

I want to get this
http://google.com


My regular expression:
def find_urls(string):
    """ Extract all URL's from a string & return as a list """

    url_list = re.findall(r'(?:http://|www.).*?["]',string)
    return url_list



Is This A Good Question/Topic? 0
  • +

Replies To: Python: Get a URL using regular expressions

#2 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 4949
  • View blog
  • Posts: 11,356
  • Joined: 16-October 07

Re: Python: Get a URL using regular expressions

Posted 30 April 2010 - 04:55 AM

You're assuming your URL ends with "? That's probably a poor choice. However, if you want to do it, just grab everything that's not the terminator, like so:
re.findall('(?:http://|www.)[^"]+', html)



I would probably scan until I hit a character that clearly can't be part of a url. e.g.
re.findall('(?:http://|www.)[^"\' ]+', html)



There's certainly a few more. Inclusive rather than exclusive might be the way to go. Also, that "|www" business is suspect. Figure out if you're looking for real urls with a protocol at the front or just the contents of anchor tags and go from there.


If it's really "<a href" you're looking for, here's some code that may work for you.
def find_urls(html):
	url_list = []
	for tag in re.findall('< *a +href[^>]+', html): # find the a tags
		m = re.search('href *= *[\'"]([^\'"]+)', s) # find the href
		if m:
			url_list.append(m.group(1)) # add the sub match
	return url_list


Was This Post Helpful? 0
  • +
  • -

Page 1 of 1