Downloading Links from a Webpage with Python

  • (2 Pages)
  • +
  • 1
  • 2

19 Replies - 948 Views - Last Post: 24 November 2013 - 11:11 AM Rate Topic: -----

#1 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Downloading Links from a Webpage with Python

Posted 22 November 2013 - 04:04 PM

Hi,

I am a relatively new to the python language but I have created two scripts and the purpose of these scripts were to download a webpage and any/all links within the webpage. The webpage script works and both scripts run without any errors however the second script returns no links to matter what website url I give it. Does anyone know why? Could you look at the code and see if you can see why not? Thanks

Webpage script
import sys, urllib
def getWebpage(url):
    print '[*] getWebpage()'
    url_file = urllib.urlopen(url)
    page = url_file.read()
    return page
def main():
    sys.argv.append('http://www.funeralformyfat.tumblr.com')
    if len(sys.argv) != 2:
        print '[-] Usage: webpage_get URL'
        return
    else:
        print getWebpage(sys.argv[1])

if __name__ == '__main__':
    main()


Links Script
import sys, urllib
def print_links(page):
    print '[*] print_links()'
    links = re.findall(r'\<a.*href\=.*http\:.+', page)
    links.sort()
    print '[+]', str(len(links)), 'HyperLinks Found:'

    for link in links:
        print link
    
def main():
    sys.argv.append('http://www.funeralformyfat.tumblr.com')
    if len(sys.argv) != 2:
        print '[-] Usage: webpage_links URL'
        return
        page = webpage_get.getWebpage(sys.argv[1])
        print_links(page)

        
if __name__ == '__main__':
    main()


Is This A Good Question/Topic? 0
  • +

Replies To: Downloading Links from a Webpage with Python

#2 Papillon  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 3
  • Joined: 30-October 13

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 05:55 AM

It seems you simply forgot an else statement.
13	    if len(sys.argv) != 2:
14	        print '[-] Usage: webpage_links URL'
15	        return
16	        page = webpage_get.getWebpage(sys.argv[1])
17	        print_links(page)

As it is now it will return on false input or not do anything at all.
Was This Post Helpful? 1
  • +
  • -

#3 andrewsw  Icon User is offline

  • It's just been revoked!
  • member icon

Reputation: 3809
  • View blog
  • Posts: 13,508
  • Joined: 12-December 12

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 06:32 AM

You could use an else clause, or indent it correctly:

13	    if len(sys.argv) != 2:
14	        print '[-] Usage: webpage_links URL'
15	        return
16	    page = webpage_get.getWebpage(sys.argv[1])
17	    print_links(page)

Was This Post Helpful? 0
  • +
  • -

#4 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 07:04 AM

Hi thanks. I'd figured out my issue but now when I run my code I get the error NameError: global name 'webpage_get' is not defined, I think this is because my script isn't picking up my first one but I'm unsure how to resolve this, can you help?
Was This Post Helpful? 0
  • +
  • -

#5 andrewsw  Icon User is offline

  • It's just been revoked!
  • member icon

Reputation: 3809
  • View blog
  • Posts: 13,508
  • Joined: 12-December 12

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 07:13 AM

Currently there is no connection between these two scripts - neither imports the other - which is why you get the error.

Do you need to have these as two separate scripts? if not, just merge them. Otherwise, one will need to import the other.
Was This Post Helpful? 1
  • +
  • -

#6 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 12:04 PM

View Postandrewsw, on 23 November 2013 - 07:13 AM, said:

Currently there is no connection between these two scripts - neither imports the other - which is why you get the error.

Do you need to have these as two separate scripts? if not, just merge them. Otherwise, one will need to import the other.


I'd prefer to keep them as two scripts just so I don't get overwhelmed with code. It's the importing of scripts I'm not sure of. How would I import one into another?
Was This Post Helpful? 0
  • +
  • -

#7 andrewsw  Icon User is offline

  • It's just been revoked!
  • member icon

Reputation: 3809
  • View blog
  • Posts: 13,508
  • Joined: 12-December 12

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 12:11 PM

In the same way that you can import sys:

import yourfile

Only one of the main() methods will run though, as only one file will be __main__. So I would still modify the files. Effectively, one should be considered a library (a module) and the other the main application.
Was This Post Helpful? 0
  • +
  • -

#8 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 12:25 PM

I thought that and I've been doing that but it tells me it doesn't exist. Thanks for answering anyway.

This post has been edited by andrewsw: 23 November 2013 - 12:48 PM
Reason for edit:: Removed previous quote

Was This Post Helpful? 0
  • +
  • -

#9 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 12:39 PM

Hi thanks for all your help. I managed to get my code to run without any errors but it's not returning any links no matter the URL. Can you see why?

import sys, urllib, re
import getWebpage
def print_links(page):
    print '[*] print_links()'
    links = re.findall(r'\<a.*href\=.*http\:.+', page)
    links.sort()
    print '[+]', str(len(links)), 'HyperLinks Found:'

    for link in links:
        print link
    
def main():
    sys.argv.append('http://http://www.bbc.co.uk/')
    if len(sys.argv) != 2:
        print '[-] Usage: webpage_links URL'
        return
    else:
        page = webpage_get.getWebpage(sys.argv[1])
        print_links(page)


This post has been edited by andrewsw: 23 November 2013 - 12:49 PM
Reason for edit:: Removed unnecessary quote - just use the Reply button

Was This Post Helpful? 0
  • +
  • -

#10 andrewsw  Icon User is offline

  • It's just been revoked!
  • member icon

Reputation: 3809
  • View blog
  • Posts: 13,508
  • Joined: 12-December 12

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 12:47 PM

Do you receive any errors? In particular:

webpage_get.getWebpage(sys.argv[1])

what is webpage_get? The file you are importing is named getWebpage (supposedly).

Anyway, assuming the code runs, print out len(sys.argv) and perhaps loop through and print these arguments. If the length is not 2 then this is the first issue to correct - otherwise your code that prints the links will never be called.




What does your code print out currently anyway?

This post has been edited by andrewsw: 23 November 2013 - 12:48 PM

Was This Post Helpful? 0
  • +
  • -

#11 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 01:09 PM

My first script download's a webpage and the second is meant to take any/all links contained within that webpage and show me them or at least that's what I'm attempting to do.

This post has been edited by andrewsw: 23 November 2013 - 01:26 PM
Reason for edit:: Removed previous quote

Was This Post Helpful? 0
  • +
  • -

#12 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 01:16 PM

The length is equal to 1 that's where the problem is coming from.
Was This Post Helpful? 0
  • +
  • -

#13 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 02:16 PM

Hi, thanks. I've fixed it and it returns links but can I ask one more question?

This is my regular expression currently;

links = re.findall(r'\<a.*href\=.*http\:', page)


I've tried many ways but the links keeps returning in one long continuous list, do you know what the regular expression is to format them into one list on a separate line so they're easy to read? I've tried a lot of different expressions so far but haven't been able to get it to list them.
Was This Post Helpful? 0
  • +
  • -

#14 Ryano121  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1363
  • View blog
  • Posts: 3,002
  • Joined: 30-January 11

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 02:22 PM

One list on a separate line?
Was This Post Helpful? 0
  • +
  • -

#15 MillyH  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 27
  • Joined: 20-November 13

Re: Downloading Links from a Webpage with Python

Posted 23 November 2013 - 02:31 PM

All the links on a separate line because right now they're all just one continuous line of links and headers. I want to format them I guess so they're all on their own line and easy to read because right now I can't read them because it's all just one message.
Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2