Downloading Links from a Webpage with Python

  • (2 Pages)
  • +
  • 1
  • 2

19 Replies - 888 Views - Last Post: 24 November 2013 - 11:11 AM Rate Topic: -----

#16 andrewsw  Icon User is online

  • It's just been revoked!
  • member icon

Reputation: 3620
  • View blog
  • Posts: 12,491
  • Joined: 12-December 12

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 02:51 PM

Is links a list containing a single element? That is, one long string. Or are their separate links, but they don't stop at the end of the link-tag? Please clarify, and posting some samples from the links-list would help.
Was This Post Helpful? 0
  • +
  • -

#17 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 03:08 PM

My script returns 13 links. I've uploaded a picture of what they currently look like but what I want them to look like is;

http://www.bbc.co.uk/
http://www.bbc.co.uk/news/

They currently appear like this;
<a href="http://www.bbc.co.uk/aboutthebbc/" >About the BBC</a> </li> <li> <a href="http://www.bbc.co.uk/bbctrust/" >BBC Trust</a> </li> <li> <a href="http://www.bbc.co.uk/cbbc/" >CBBC</a> </li> <li> <a href="http://www.bbc.co.uk/cbeebies/" >CBeebies</a> </li> <li> <a href="http://www.bbc.co.uk/food/" >Food</a> </li> <li> <a href="http://www.bbc.co.uk/health/" >Health</a> </li> <li> <a href="http://www.bbc.co.uk/history/" >History</a> </li> <li> <a href="http://www.bbc.co.uk/iplayer/" >iPlayer</a> </li> <li> <a href="http://www.bbc.co.uk/radio/" >iPlayer Radio</a> </li> <li> <a href="http://www.bbc.co.uk/learning/" >Learning</a> </li> <li> <a href="http://www.bbc.co.uk/music/" >Music</a> </li> <li> <a href="http://news.bbc.co.uk/local/hi/default.stm"

I'm trying to get just the links to appear.

This is my current code;
links = re.findall(r'\<a.*href\=.*http\:', page)

Was This Post Helpful? 0
  • +
  • -

#18 andrewsw  Icon User is online

  • It's just been revoked!
  • member icon

Reputation: 3620
  • View blog
  • Posts: 12,491
  • Joined: 12-December 12

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 04:03 PM

You shouldn't use regex to do this - it is unreliable and messy. In particular, your regex isn't stopping when it should and it will require a more complex regex to achieve this than you currently have.

Use, for example, BeautifulSoup. You'll end up with code similar to this:

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
        print link['href']

With winter coming, soup is good for you ;)

Beautiful Soup docs
Was This Post Helpful? 0
  • +
  • -

#19 Ryano121  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1362
  • View blog
  • Posts: 3,002
  • Joined: 30-January 11

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 04:13 PM

If you know that the html you are parsing is always well formed (meaning you always have double quotes around your hrefs etc etc, then the regex approach isn't that hard.

href=\"(.*?)\"


You current regex has a greedy problem. The * will match as many characters as it can - which in this case is the whole thing. Therefore you only get one match back. Instead you use the ? after the * to tell the regex engine to turn the greediness off - only select the minimum number of characters instead.

In addition you can simplify your regex if you just remove all the whitespace in the html before you do the matching.

If you want a more dynamic approach which will should work even if there are problems with the html then you should go with the library based approach.
Was This Post Helpful? 1
  • +
  • -

#20 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Re: Downloading Links from a Webpage with Python

Posted 24 November 2013 - 11:11 AM

Thank you for all your help :)
Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2