School Assignment? Project Due Tomorrow? Chat LIVE With A Programming Expert!

Welcome to Dream.In.Code
Become an Expert!

Join 307,125 Programmers for FREE! Get instant access to thousands of experts, tutorials, code snippets, and more! There are 2,045 people online right now. Registration is fast and FREE... Join Now!




Perl Web Crawler

 

Perl Web Crawler

Cbeppe

30 Oct, 2009 - 03:08 AM
Post #1

New D.I.C Head
*

Joined: 16 Sep, 2009
Posts: 12


My Contributions
Hey,

I'm quite new to Programming and OO programming especially. Nonetheless, I'm trying to write a very simple Spider for web crawling. Here's the code:

CODE

#!C:\Perl\bin\perl

use warnings;

BEGIN {
    open my $file1,"+>>", ("links.txt");
    select($file1);  
}
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

#The Url I want it to start at;
$URL = "http://www.computersecrets.eu.pn/";

#Request and receive contents of a web page;
for ($URL) {
$contents = get ($URL);
$browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);    
my $request = HTTP::Request->new(GET => $URL);
my $response = $browser->request($request);

#Tell me if there is an error;
if ($response->is_error()) {printf "%s\n", $response->status_line;}
$contents = $response->content();

#Extract the links from the HTML;
my ($page_parser) = HTML::LinkExtor->new(undef, $URL);
    $page_parser->parse($contents)->eof;
    @links = $page_parser->links;

#Print the link to a links.txt file;
foreach $link (@links) {print "$$link[2]\n";}
}


The problem is that I can't seem to figure out how to take the @links array and put back into $URL. Like it is now, it will fetch the links and print them, but it won't follow them. Any ideas as to how I can get it to follow the links it finds?

Thanks for your help guys wink2.gif

User is offlineProfile CardPM
+Quote Post


dsherohman

RE: Perl Web Crawler

30 Oct, 2009 - 05:23 AM
Post #2

D.I.C Head
**

Joined: 29 Mar, 2009
Posts: 204



Thanked: 36 times
My Contributions
$URL is your problem... It's a scalar value (indicated by the leading "$") and scalars can only hold one thing at a time. You can't put @links into $URL because @links is a list of values, not a single value.

Try this instead:
CODE

#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);  

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL
my @urls = ('http://www.computersecrets.eu.pn/');

# I'm not positive, but this should only need to be set up once, not
# on every pass through the loop
my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);

#Request and receive contents of a web page;
# Need to use a while loop instead of a for loop because @urls will
# be changing as we go
while (@urls) {
  my $url = shift @urls;
  my $request = HTTP::Request->new(GET => $URL);
  my $response = $browser->request($request);

  #Tell me if there is an error;
  if ($response->is_error()) {printf "%s\n", $response->status_line;}
  my $contents = $response->content();

  #Extract the links from the HTML;
  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  @links = $page_parser->links;

  #Print the link to a links.txt file;
  foreach $link (@links) {
    push @urls, $$link[2];  # Add link to list of urls before printing it
    print "$$link[2]\n";
  }

  # This next line is MANDATORY - spidering a site as fast as you can
  # will probably bog it down, may crash it, may get your IP address
  # blacklisted (I've written monitors in the past which do just that),
  # and is absolutely certain to piss the admins off.
  sleep 60;
}

(Untested code, but the changes are simple enough that it should work at least as well as the code you posted.)

Also, you might want to take a look at WWW::Mechanize, which has a lot of this kind of functionality built into it already, as well as some other things you forgot and I didn't bother to add, such as checking robots.txt. CPAN searches for things like "spider", "crawler", or "robot" may also turn up modules which will help to minimize the amount of new code you need to write.
User is offlineProfile CardPM
+Quote Post

Cbeppe

RE: Perl Web Crawler

30 Oct, 2009 - 05:52 AM
Post #3

New D.I.C Head
*

Joined: 16 Sep, 2009
Posts: 12


My Contributions
Thanks for that. I fixed some declarations that were needed because of strict. It crawls for the given 60 seconds, but it doesn't seem to get any links. The file that it prints to tells me "400 URL missing" . I tried changing the initial site, but that wasn't the problem. Can you see any place where this problem might be caused?
CODE

#!C:\Perl\bin\perl

use strict;
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

open my $file1,">>", ("links.txt");
select($file1);  

my @urls = ('http://www.youtube.com/');

my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);

while (@urls) {
  my $url = shift @urls;
  my $request = HTTP::Request->new(GET => my $URL);
  my $response = $browser->request($request);

  if ($response->is_error()) {printf "%s\n", $response->status_line;}
  my $contents = $response->content();

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;


  foreach my $link (my @links) {
    push @urls, $$link[2];
    print "my $$link[2]\n";
  }
  sleep 60;
}

User is offlineProfile CardPM
+Quote Post

dsherohman

RE: Perl Web Crawler

31 Oct, 2009 - 06:08 AM
Post #4

D.I.C Head
**

Joined: 29 Mar, 2009
Posts: 204



Thanked: 36 times
My Contributions
QUOTE(Cbeppe @ 30 Oct, 2009 - 01:52 PM) *
Thanks for that. I fixed some declarations that were needed because of strict. It crawls for the given 60 seconds, but it doesn't seem to get any links. The file that it prints to tells me "400 URL missing" . I tried changing the initial site, but that wasn't the problem. Can you see any place where this problem might be caused?

CODE
  my $url = shift @urls;
  my $request = HTTP::Request->new(GET => my $URL);


Perl variable names are case-sensitive. The first of these two lines grabs the first item from @urls and puts it into the variable $url (lower-case). The second then creates a brand-new variable named $URL (upper-case) and attempts to retrieve it; since this variable hasn't been initialized, it has no value, which is why you get the "400 URL missing" error.

Once I changed the second line to
CODE
my $request = HTTP::Request->new(GET => $url);
I was able to successfully retrieve http://www.youtube.com/ and 232 links were found.

The output isn't working in your code because
QUOTE
CODE
  foreach my $link (my @links) {
creates a new (and empty) @links array rather than using the existing one, so there's nothing for the loop to print. Once I removed that "my" the 232 links were printed.

Always remember that whenever you use the "my" keyword, you are creating a new variable. If you're trying to re-use an existing variable and strict complains about it, then you've probably misspelled the variable's name (either that or it's gone out of scope), which is exactly the sort of thing that strict is intended to help guard against.

A couple other side notes:

- The reason I changed your original $URL variable to $url is that Perl convention is to use all-lowercase variable names. While you can use uppercase in variable names and perl won't complain, other Perl programmers are likely to look at you funny if you ever end up sharing code with them or writing Perl in a team environment.

- Looking over the list of the 232 links found on youtube's front page, I noticed another detail you'll need to address if you want to build your own spider: Duplicate tracking. Keep a list of URLs that you've already visited (I recommend using a hash for this) and skip them if you run across them again. I can add this to your code on request to show you how to do it, but I'll give you a chance to try to work it out for yourself first. smile.gif
User is offlineProfile CardPM
+Quote Post

Cbeppe

RE: Perl Web Crawler

31 Oct, 2009 - 09:28 AM
Post #5

New D.I.C Head
*

Joined: 16 Sep, 2009
Posts: 12


My Contributions
Thank you very much. It works perfectly. Instead of your suggestion to make sure it skips links it crawled before, I was going to limit to a specific domain. This is because I plan to use it to check links on my site. I'm not 100% sure how to do that, but I would think I'm going to need some "if, else"s. One thing I don't have a clue about is how to insert the domain. Nonetheless, I will do some experiments and see what I can do. (But if what I said is completely wrong, do tell me biggrin.gif )

For now, though, thanks so much for your help. It's very appreciated smile.gif
User is offlineProfile CardPM
+Quote Post

Cbeppe

RE: Perl Web Crawler

31 Oct, 2009 - 10:12 AM
Post #6

New D.I.C Head
*

Joined: 16 Sep, 2009
Posts: 12


My Contributions
Ok, I suddenly got this beautiful idea to disregard your tip and use an array instead. Mostly because I have no clue how to do it with a hash.

Here's the main part of the code now:
CODE

my @urls = ('http://computersecrets.eu.pn/');
my @visited;
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
  my $url = shift @urls;
  if ($url = @visited){
    die;
    }
  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  if ($response->is_error()) {printf "%s\n", $response->status_line;}
  my $contents = $response->content();

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;
  
  foreach my $link (@links) {
    print "$$link[2]\n";
    push @urls, $$link[2];
    push @visited, $$link[2];
  }
  sleep 60;
}
close $file1;


As you can see, the point of it is to push the link it just visited to the new array called "@visited". Then when it comes back around, it checks to see if "$url" and "@visited" matches, and if they do, it is supposed to skip to the next part.

Unfortunately, my sense of logic and Perl don't always get along, and this is the case here. The link only returns an "error 400". I'd appreciate it if someone could look it over and tell me where/why it's messing up.

Thanks a lot wink2.gif
User is offlineProfile CardPM
+Quote Post

dsherohman

RE: Perl Web Crawler

1 Nov, 2009 - 05:25 AM
Post #7

D.I.C Head
**

Joined: 29 Mar, 2009
Posts: 204



Thanked: 36 times
My Contributions
QUOTE(Cbeppe @ 31 Oct, 2009 - 06:12 PM) *
Ok, I suddenly got this beautiful idea to disregard your tip and use an array instead. Mostly because I have no clue how to do it with a hash.

The problem with using an array here (aside from your implementation of it not working smile.gif ) is that checking an array to see whether it contains a value requires you to check the first value in the array to see if it matches, then check the second for a match, then the third, and so on. You can use the "grep" command to automate this, but, if you've got more than a handful of items in the array, it gets a bit slow. With a hash, you can determine whether a value is there or not just as quickly, no matter how many or how few values it contains.

The main advantage of arrays over hashes is that arrays maintain the order of their contents, while hashes will rearrange them into a random order. (Not really random, actually, but unpredictable enough that you may as well treat it as random for most purposes.) So use arrays when you care about the order of things and hashes when you only care whether it's there or not.

QUOTE(Cbeppe @ 31 Oct, 2009 - 06:12 PM) *
As you can see, the point of it is to push the link it just visited to the new array called "@visited". Then when it comes back around, it checks to see if "$url" and "@visited" matches, and if they do, it is supposed to skip to the next part.


...except that's not what it checks.

CODE
if ($url = @visited){
has two problems:

1) "Context" is a big deal in Perl and, because $url is a scalar, that expression is evaluated in "scalar context" - basically, $url only holds a single value, so @visited gets turned into a single value so they can interact with each other. When you evaluate an array in scalar context like that, you get the number of items in the array. e.g., If @visited contains ('foo', 'bar', 'baz'), then @visited will become the value 3 in scalar context.

2) "=" is the assignment operator, so it's not comparing $url and @visited at all, it's getting the number of items in @visited and assigning that value to $url. With the sample array from the last paragraph, this would be equivalent to "if ($url = 3)". The end result is that this expression will be true (causing the program to die) if there's anything in @visited and false (causing $url to be set to 0, which leads to your "error 400") if @visited is empty. You need to use "eq" to compare strings with each other: "if ($url eq $other_url)"

So... First, here's a working way to determine whether $url is in @visited:
CODE
if (grep { $_ eq $url } @visited) {
The grep function will assign each element of the target list (@visited) to $_ in turn and return those elements where "$_ eq $url" is true. But, like I mentioned above, this will get slower and slower as @visited grows, since it needs to check each value in the array individually.

Now that you know how to do that, here's how I would do the duplicate check using a hash:
CODE
my @urls = ('http://computersecrets.eu.pn/');
my %visited;  # The % sigil indicates it's a hash
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
  my $url = shift @urls;

  # Skip this URL and go on to the next one if we've
  # seen it before
  next if $visited{$url};
    
  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  # No real need to invoke printf if we're not doing
  # any formatting
  if ($response->is_error()) {print $response->status_line, "\n";}
  my $contents = $response->content();

  # Now that we've got the url's content, mark it as
  # visited
  $visited{$url} = 1;

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;

  foreach my $link (@links) {
    print "$$link[2]\n";
    push @urls, $$link[2];
  }
  sleep 60;
}

User is offlineProfile CardPM
+Quote Post

Cbeppe

RE: Perl Web Crawler

1 Nov, 2009 - 06:00 AM
Post #8

New D.I.C Head
*

Joined: 16 Sep, 2009
Posts: 12


My Contributions
Once again I can only say thank you for your help. I'm realizing that the project might have been a little over my head, but then again, that's how you learn.

Again, Thank you so much. You really helped me a lot with this.
User is offlineProfile CardPM
+Quote Post

dsherohman

RE: Perl Web Crawler

2 Nov, 2009 - 04:58 AM
Post #9

D.I.C Head
**

Joined: 29 Mar, 2009
Posts: 204



Thanked: 36 times
My Contributions
QUOTE(Cbeppe @ 1 Nov, 2009 - 02:00 PM) *
Once again I can only say thank you for your help.

No problem. Helping people with Perl problems is why I come to this site.

QUOTE(Cbeppe @ 1 Nov, 2009 - 02:00 PM) *
I'm realizing that the project might have been a little over my head, but then again, that's how you learn.

As it is written in The Cult of Done Manifesto,
QUOTE
Pretending you know what you're doing is almost the same as knowing what you are doing, so just accept that you know what you're doing even if you don't and do it.


Welcome to the world of Perl! I hope you have as much fun with it as I have.
User is offlineProfile CardPM
+Quote Post

Fast ReplyReply to this topicStart new topic

Time is now: 11/21/09 02:18PM

Live Help!

Be Social

Dream.In.Code RSS Feed Dream.In.Code LinkedIn Group Follow Us On Twitter Fan Us On Facebook

Tutorials

Programming

Web Development

Reference Sheets

Code Snippets

DIC Chatroom

Bye Bye Ads

Monthly Drawing

Thumb Drive

Top Contributors

Top 10 Kudos This Month