Welcome to Dream.In.Code
Become an Expert!

Join 149,477 Programmers for FREE! Get instant access to thousands of experts, tutorials, code snippets, and more! There are 1,765 people online right now. Registration is fast and FREE... Join Now!




Search for URLS in a webpage with RE

 
Reply to this topicStart new topic

Search for URLS in a webpage with RE

obNiko
2 May, 2007 - 01:45 AM
Post #1

New D.I.C Head
*

Joined: 2 May, 2007
Posts: 7


My Contributions
Hi you all,
Im new to Python language.

First I wish to say that when I tried to search my problem, an error occurred.
QUOTE

An error occurred!
Error: HTTP Error: Unsupported HTTP response status 502 Bad Gateway (soapclient->response has contents of the response)


My problem is:

Im working on a app. that will get a url of a site and it will search
for alll kinds of links (http://,news://,ftp://,www.) at that site and print them.

I tried those functions:
CODE

def getSource(Host,Path):
    file = urllib.urlopen("http://" + Host + Path);
    text = file.read();
    return text;

def seekLinks(source):
    ex = "[http://|www.|ftp://|news://].[\.htm|\.com]";
    r = re.compile(ex,re.DOTALL |  re.IGNORECASE);
    for item in re.findall(r, source):
        print item;


I successed in getting the Source code,
however when I ran the "seekLinks" function I got
lot of results that contained only 3 characters.

For example:
CODE

e-M
nSt
nam
nam
wmo
ent
nam
src
htt
na.
s/C
eat
s/O
tec
e-M
nSt
wmo
ent
nam
wSc
tAc
sam
eDo
tio
sho
htt
ww.
.co
/go
eCo
nec
/em
ect
/bo
/ht


Is my Regex code is wrong?
Waiting for help - Thank to the helper.

User is offlineProfile CardPM
+Quote Post

William_Wilson
RE: Search For URLS In A Webpage With RE
5 May, 2007 - 06:56 PM
Post #2

lost in compilation
Group Icon

Joined: 23 Dec, 2005
Posts: 4,101



Thanked: 25 times
Dream Kudos: 3275
Expert In: Java, C, Javascript

My Contributions
it seems about right, except you don't escape the . on www.
Don't think that will solve the problem, but it's a start.

Could you post some of the results you were expecting, maybe what the code is grabbing from the urls can be determined.
User is offlineProfile CardPM
+Quote Post

JellyBean
RE: Search For URLS In A Webpage With RE
8 May, 2007 - 12:39 AM
Post #3

D.I.C Head
**

Joined: 25 Apr, 2007
Posts: 60


My Contributions
QUOTE(obNiko @ 2 May, 2007 - 02:45 AM) *

ex = "[http://|www.|ftp://|news://].[\.htm|\.com]";

It seems that when you put a word in squared brackets a regex parser will match any of the letters. I think a better solution is to use standard bracked, as follows:
(http://|www\.|ftp://|news://).+\.(com|htm?)

It worked for me and can be easily modified to include other TLD's and service types.

Hope this helps!
User is offlineProfile CardPM
+Quote Post

obNiko
RE: Search For URLS In A Webpage With RE
18 May, 2007 - 08:22 AM
Post #4

New D.I.C Head
*

Joined: 2 May, 2007
Posts: 7


My Contributions
QUOTE(JellyBean @ 8 May, 2007 - 01:39 AM) *

QUOTE(obNiko @ 2 May, 2007 - 02:45 AM) *

ex = "[http://|www.|ftp://|news://].[\.htm|\.com]";

It seems that when you put a word in squared brackets a regex parser will match any of the letters. I think a better solution is to use standard bracked, as follows:
(http://|www\.|ftp://|news://).+\.(com|htm?)

It worked for me and can be easily modified to include other TLD's and service types.

Hope this helps!


Here is some code from the Python Shell:
CODE

>>> import re, urllib;
>>> re.purge();
>>> HOST = raw_input("The HOST:");
The HOST:www.website.com
>>> PATH = raw_input("The path:");
The path:/
>>> def getSource(Host,Path):
    file = urllib.urlopen("http://" + Host + Path);
    text = file.read();
    return text;

>>> def seekLinks(source):
       for item in re.findall("(http://|www\.|ftp://|news://).+\.(com|htm?)", source):
                 print item;

                 
>>> source = getSource(HOST,PATH);
>>> links = seekLinks(source);
>>>


As you can see, there is not output.

I excpet results like this:
CODE

http://website.com/newFile.html
http://website.com/moreblabla.html
http://otherwebsite.com/

And such.

Thank for help.
User is offlineProfile CardPM
+Quote Post

Fast ReplyReply to this topicStart new topic
Time is now: 1/7/09 03:11PM

Be Social

Dream.In.Code RSS Feed Dream.In.Code LinkedIn Group Follow Us On Twitter

Live Help!

Tutorials

Programming

Web Development

Reference Sheets

Code Snippets

DIC Chatroom

Bye Bye Ads

Monthly Drawing

Thumb Drive

Top Contributors

Top 10 Kudos This Month