Perl Web Crawler

  • (2 Pages)
  • +
  • 1
  • 2

16 Replies - 21674 Views - Last Post: 26 January 2018 - 01:44 PM

#1 Cbeppe   User is offline

  • D.I.C Head
  • member icon

Reputation: 31
  • View blog
  • Posts: 216
  • Joined: 16-September 09

Perl Web Crawler

Post icon  Posted 30 October 2009 - 04:08 AM

Hey,

I'm quite new to Programming and OO programming especially. Nonetheless, I'm trying to write a very simple Spider for web crawling. Here's the code:

#!C:\Perl\bin\perl

use warnings;

BEGIN {
	open my $file1,"+>>", ("links.txt");
	select($file1);  
}
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

#The Url I want it to start at;
$URL = "http://www.computersecrets.eu.pn/";

#Request and receive contents of a web page;
for ($URL) {
$contents = get ($URL);
$browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);	
my $request = HTTP::Request->new(GET => $URL);
my $response = $browser->request($request);

#Tell me if there is an error;
if ($response->is_error()) {printf "%s\n", $response->status_line;}
$contents = $response->content();

#Extract the links from the HTML;
my ($page_parser) = HTML::LinkExtor->new(undef, $URL);
	$page_parser->parse($contents)->eof;
	@links = $page_parser->links;

#Print the link to a links.txt file;
foreach $link (@links) {print "$$link[2]\n";}
} 


The problem is that I can't seem to figure out how to take the @links array and put back into $URL. Like it is now, it will fetch the links and print them, but it won't follow them. Any ideas as to how I can get it to follow the links it finds?

Thanks for your help guys ;)

Is This A Good Question/Topic? 0
  • +

Replies To: Perl Web Crawler

#2 dsherohman   User is offline

  • Perl Parson
  • member icon

Reputation: 227
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Perl Web Crawler

Posted 30 October 2009 - 06:23 AM

$URL is your problem... It's a scalar value (indicated by the leading "$") and scalars can only hold one thing at a time. You can't put @links into $URL because @links is a list of values, not a single value.

Try this instead:
#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);  

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL
my @urls = ('http://www.computersecrets.eu.pn/');

# I'm not positive, but this should only need to be set up once, not
# on every pass through the loop
my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);

#Request and receive contents of a web page;
# Need to use a while loop instead of a for loop because @urls will
# be changing as we go
while (@urls) {
  my $url = shift @urls;
  my $request = HTTP::Request->new(GET => $URL);
  my $response = $browser->request($request);

  #Tell me if there is an error;
  if ($response->is_error()) {printf "%s\n", $response->status_line;}
  my $contents = $response->content();

  #Extract the links from the HTML;
  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  @links = $page_parser->links;

  #Print the link to a links.txt file;
  foreach $link (@links) {
	push @urls, $$link[2];  # Add link to list of urls before printing it
	print "$$link[2]\n";
  }

  # This next line is MANDATORY - spidering a site as fast as you can
  # will probably bog it down, may crash it, may get your IP address
  # blacklisted (I've written monitors in the past which do just that),
  # and is absolutely certain to piss the admins off.
  sleep 60;
}


(Untested code, but the changes are simple enough that it should work at least as well as the code you posted.)

Also, you might want to take a look at WWW::Mechanize, which has a lot of this kind of functionality built into it already, as well as some other things you forgot and I didn't bother to add, such as checking robots.txt. CPAN searches for things like "spider", "crawler", or "robot" may also turn up modules which will help to minimize the amount of new code you need to write.
Was This Post Helpful? 0
  • +
  • -

#3 Cbeppe   User is offline

  • D.I.C Head
  • member icon

Reputation: 31
  • View blog
  • Posts: 216
  • Joined: 16-September 09

Re: Perl Web Crawler

Posted 30 October 2009 - 06:52 AM

Thanks for that. I fixed some declarations that were needed because of strict. It crawls for the given 60 seconds, but it doesn't seem to get any links. The file that it prints to tells me "400 URL missing" . I tried changing the initial site, but that wasn't the problem. Can you see any place where this problem might be caused?
#!C:\Perl\bin\perl

use strict; 
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

open my $file1,">>", ("links.txt");
select($file1);  

my @urls = ('http://www.youtube.com/');

my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);

while (@urls) {
  my $url = shift @urls;
  my $request = HTTP::Request->new(GET => my $URL);
  my $response = $browser->request($request);

  if ($response->is_error()) {printf "%s\n", $response->status_line;}
  my $contents = $response->content();

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;


  foreach my $link (my @links) {
	push @urls, $$link[2];
	print "my $$link[2]\n";
  }
  sleep 60;
}


Was This Post Helpful? 0
  • +
  • -

#4 dsherohman   User is offline

  • Perl Parson
  • member icon

Reputation: 227
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Perl Web Crawler

Posted 31 October 2009 - 07:08 AM

View PostCbeppe, on 30 Oct, 2009 - 01:52 PM, said:

Thanks for that. I fixed some declarations that were needed because of strict. It crawls for the given 60 seconds, but it doesn't seem to get any links. The file that it prints to tells me "400 URL missing" . I tried changing the initial site, but that wasn't the problem. Can you see any place where this problem might be caused?

  my $url = shift @urls;
  my $request = HTTP::Request->new(GET => my $URL);


Perl variable names are case-sensitive. The first of these two lines grabs the first item from @urls and puts it into the variable $url (lower-case). The second then creates a brand-new variable named $URL (upper-case) and attempts to retrieve it; since this variable hasn't been initialized, it has no value, which is why you get the "400 URL missing" error.

Once I changed the second line to
my $request = HTTP::Request->new(GET => $url);
I was able to successfully retrieve http://www.youtube.com/ and 232 links were found.

The output isn't working in your code because

Quote

  foreach my $link (my @links) {
creates a new (and empty) @links array rather than using the existing one, so there's nothing for the loop to print. Once I removed that "my" the 232 links were printed.

Always remember that whenever you use the "my" keyword, you are creating a new variable. If you're trying to re-use an existing variable and strict complains about it, then you've probably misspelled the variable's name (either that or it's gone out of scope), which is exactly the sort of thing that strict is intended to help guard against.

A couple other side notes:

- The reason I changed your original $URL variable to $url is that Perl convention is to use all-lowercase variable names. While you can use uppercase in variable names and perl won't complain, other Perl programmers are likely to look at you funny if you ever end up sharing code with them or writing Perl in a team environment.

- Looking over the list of the 232 links found on youtube's front page, I noticed another detail you'll need to address if you want to build your own spider: Duplicate tracking. Keep a list of URLs that you've already visited (I recommend using a hash for this) and skip them if you run across them again. I can add this to your code on request to show you how to do it, but I'll give you a chance to try to work it out for yourself first. :)
Was This Post Helpful? 0
  • +
  • -

#5 Cbeppe   User is offline

  • D.I.C Head
  • member icon

Reputation: 31
  • View blog
  • Posts: 216
  • Joined: 16-September 09

Re: Perl Web Crawler

Posted 31 October 2009 - 10:28 AM

Thank you very much. It works perfectly. Instead of your suggestion to make sure it skips links it crawled before, I was going to limit to a specific domain. This is because I plan to use it to check links on my site. I'm not 100% sure how to do that, but I would think I'm going to need some "if, else"s. One thing I don't have a clue about is how to insert the domain. Nonetheless, I will do some experiments and see what I can do. (But if what I said is completely wrong, do tell me :D )

For now, though, thanks so much for your help. It's very appreciated :)
Was This Post Helpful? 0
  • +
  • -

#6 Cbeppe   User is offline

  • D.I.C Head
  • member icon

Reputation: 31
  • View blog
  • Posts: 216
  • Joined: 16-September 09

Re: Perl Web Crawler

Posted 31 October 2009 - 11:12 AM

Ok, I suddenly got this beautiful idea to disregard your tip and use an array instead. Mostly because I have no clue how to do it with a hash.

Here's the main part of the code now:
my @urls = ('http://computersecrets.eu.pn/');
my @visited;
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
  my $url = shift @urls;
  if ($url = @visited){
	die;
	}
  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  if ($response->is_error()) {printf "%s\n", $response->status_line;}
  my $contents = $response->content();

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;
  
  foreach my $link (@links) {
	print "$$link[2]\n";
	push @urls, $$link[2];
	push @visited, $$link[2];
  }
  sleep 60;
}
close $file1;


As you can see, the point of it is to push the link it just visited to the new array called "@visited". Then when it comes back around, it checks to see if "$url" and "@visited" matches, and if they do, it is supposed to skip to the next part.

Unfortunately, my sense of logic and Perl don't always get along, and this is the case here. The link only returns an "error 400". I'd appreciate it if someone could look it over and tell me where/why it's messing up.

Thanks a lot ;)
Was This Post Helpful? 0
  • +
  • -

#7 dsherohman   User is offline

  • Perl Parson
  • member icon

Reputation: 227
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Perl Web Crawler

Posted 01 November 2009 - 06:25 AM

View PostCbeppe, on 31 Oct, 2009 - 06:12 PM, said:

Ok, I suddenly got this beautiful idea to disregard your tip and use an array instead. Mostly because I have no clue how to do it with a hash.

The problem with using an array here (aside from your implementation of it not working :) ) is that checking an array to see whether it contains a value requires you to check the first value in the array to see if it matches, then check the second for a match, then the third, and so on. You can use the "grep" command to automate this, but, if you've got more than a handful of items in the array, it gets a bit slow. With a hash, you can determine whether a value is there or not just as quickly, no matter how many or how few values it contains.

The main advantage of arrays over hashes is that arrays maintain the order of their contents, while hashes will rearrange them into a random order. (Not really random, actually, but unpredictable enough that you may as well treat it as random for most purposes.) So use arrays when you care about the order of things and hashes when you only care whether it's there or not.

View PostCbeppe, on 31 Oct, 2009 - 06:12 PM, said:

As you can see, the point of it is to push the link it just visited to the new array called "@visited". Then when it comes back around, it checks to see if "$url" and "@visited" matches, and if they do, it is supposed to skip to the next part.


...except that's not what it checks.

if ($url = @visited){
has two problems:

1) "Context" is a big deal in Perl and, because $url is a scalar, that expression is evaluated in "scalar context" - basically, $url only holds a single value, so @visited gets turned into a single value so they can interact with each other. When you evaluate an array in scalar context like that, you get the number of items in the array. e.g., If @visited contains ('foo', 'bar', 'baz'), then @visited will become the value 3 in scalar context.

2) "=" is the assignment operator, so it's not comparing $url and @visited at all, it's getting the number of items in @visited and assigning that value to $url. With the sample array from the last paragraph, this would be equivalent to "if ($url = 3)". The end result is that this expression will be true (causing the program to die) if there's anything in @visited and false (causing $url to be set to 0, which leads to your "error 400") if @visited is empty. You need to use "eq" to compare strings with each other: "if ($url eq $other_url)"

So... First, here's a working way to determine whether $url is in @visited:
if (grep { $_ eq $url } @visited) {
The grep function will assign each element of the target list (@visited) to $_ in turn and return those elements where "$_ eq $url" is true. But, like I mentioned above, this will get slower and slower as @visited grows, since it needs to check each value in the array individually.

Now that you know how to do that, here's how I would do the duplicate check using a hash:
my @urls = ('http://computersecrets.eu.pn/');
my %visited;  # The % sigil indicates it's a hash
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
  my $url = shift @urls;

  # Skip this URL and go on to the next one if we've
  # seen it before
  next if $visited{$url};
	
  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  # No real need to invoke printf if we're not doing
  # any formatting
  if ($response->is_error()) {print $response->status_line, "\n";}
  my $contents = $response->content();

  # Now that we've got the url's content, mark it as
  # visited
  $visited{$url} = 1;

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;

  foreach my $link (@links) {
	print "$$link[2]\n";
	push @urls, $$link[2];
  }
  sleep 60;
}

Was This Post Helpful? 1
  • +
  • -

#8 Cbeppe   User is offline

  • D.I.C Head
  • member icon

Reputation: 31
  • View blog
  • Posts: 216
  • Joined: 16-September 09

Re: Perl Web Crawler

Posted 01 November 2009 - 07:00 AM

Once again I can only say thank you for your help. I'm realizing that the project might have been a little over my head, but then again, that's how you learn.

Again, Thank you so much. You really helped me a lot with this.
Was This Post Helpful? 0
  • +
  • -

#9 dsherohman   User is offline

  • Perl Parson
  • member icon

Reputation: 227
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Perl Web Crawler

Posted 02 November 2009 - 05:58 AM

View PostCbeppe, on 1 Nov, 2009 - 02:00 PM, said:

Once again I can only say thank you for your help.

No problem. Helping people with Perl problems is why I come to this site.

View PostCbeppe, on 1 Nov, 2009 - 02:00 PM, said:

I'm realizing that the project might have been a little over my head, but then again, that's how you learn.

As it is written in The Cult of Done Manifesto,

Quote

Pretending you know what you're doing is almost the same as knowing what you are doing, so just accept that you know what you're doing even if you don't and do it.


Welcome to the world of Perl! I hope you have as much fun with it as I have.
Was This Post Helpful? 0
  • +
  • -

#10 learningperl27   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 01-April 11

Re: Perl Web Crawler

Posted 02 April 2011 - 11:43 PM

Thanks for your time, It was clear to understand but i have couple of doubts.


1. What does this line do ??

while (@urls) {

07 my $url = shift @urls;

2. Why we are marking 1 for visited for the url as 1, if that is the case, then will the key value will
not be 1 for all the visited urls, then how all the urls will be availble in the hash table ?


$visited{$url} = 1;

3. I beleive link[2] stores all the links, but why it is $$ ?? what does it mean ?? and
why we have[2] ?? what does it represent



print "$$link[2]\n";

4. How do i store the url and the link as a hash table ??

31 push @urls, $$link[2];



I am sorry if my questions were very silly, i am new to programming and i have no one to explain me better.

it would be great help if you can make me understand better and also some links for me to understand the perl concepts better.

thanks a lot for this help.
Was This Post Helpful? 0
  • +
  • -

#11 dsherohman   User is offline

  • Perl Parson
  • member icon

Reputation: 227
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Perl Web Crawler

Posted 04 April 2011 - 04:57 AM

View Postlearningperl27, on 03 April 2011 - 07:43 AM, said:

1. What does this line do ??

while (@urls) {
my $url = shift @urls;


The first line sets up a loop that will repeat until @urls is empty. Like I mentioned in an earlier post, an array in scalar context (boolean tests are done in scalar context) returns the number of elements in the array. The number 0 is considered false and all other numbers are true, so "@urls" will evaluate as true if it contains anything, false if it's empty.

The second line uses "shift" to remove the first item from @urls ($url[0])and put it into $url.

The reason I'm doing it this way instead of with "for my $url (@urls)" is because we'll also be adding items to the end of @urls as we go about our spidering and "for" would just grab the contents of @urls when the loop starts, so it would miss items added later, but "while" will always look at the current contents of the array.

View Postlearningperl27, on 03 April 2011 - 07:43 AM, said:

2. Why we are marking 1 for visited for the url as 1, if that is the case, then will the key value will
not be 1 for all the visited urls, then how all the urls will be availble in the hash table ?

$visited{$url} = 1;


There's no real meaning behind the 1, "$visited{$url} = 'waffles';" would work just as well. All that matters here is that it's set to a true value (that is, any value other than undef, "", the number 0, or the string "0") - the real point of this line is to create the hash entry $visited{$url} so that we can see if it exists later. (While we could use a false value, that would require changing line 11 to "next if exists $visited{$url};", which is less English-like and reads less naturally to me.)

I don't follow what you're asking after the "if that is the case"; a $visited{$url} entry will exist and it will have the value 1 for every URL that is retrieved. Or, wait... Are you thinking that 1 is being used as the hash key, which must be unique for each entry? If so, you've got it backwards - "$visited{$url} = 1" creates a hash key whose name is the value of $url (which will be unique, because line 11 would have moved us on to the next URL if it already existed) and associates the value 1 with it (which can be duplicated without issues).

View Postlearningperl27, on 03 April 2011 - 07:43 AM, said:

3. I beleive link[2] stores all the links, but why it is $$ ?? what does it mean ?? and
why we have[2] ?? what does it represent

print "$$link[2]\n";


OK, this is the complicated part...

Do you know anything about references in Perl? If not, take a look at perldoc perlref and perldoc perlreftut for complete details but, in simple terms, a reference is a scalar which points to another value, usually an array or a hash:

my @arr = qw( foo bar baz ); # @arr is an array
my $ref;                     # $ref is a scalar
$ref = \@arr;                # \ gets a reference to @arr
                             # The reference is a scalar, so $ref can hold it

my @arr2 = @$ref;            # @$ref gets the array that $ref points to,
                             # then we store a copy of that array in @arr2

print $arr[1];               # "bar"
print $$ref[1];              # Also "bar" because @$ref is the same as @arr



So, with that in mind, $$link[2] gets the third item from the array referenced by $link. I know from the HTML::LinkExtor docs that the ->links method returns a list of references to arrays containing information about the links it found, so line 29 ("foreach my $link (@links) {") loops over that list.

As for why it's [2]... I don't see that stated in the docs, so I assume that I probably used Data::Dumper to examine the data referenced by $link and saw that the linked URL was the third item in the array.

View Postlearningperl27, on 03 April 2011 - 07:43 AM, said:

4. How do i store the url and the link as a hash table ??

31 push @urls, $$link[2];


@urls is the list of links to examine, not the list of links that have been visited, so that will be empty when the program finishes, as I explained in my response to question #1.

And, as I explained in response to #2, "$visited{$url} = 1;" stores the visited URLs in the %visited hash.

But I should have shown you how to retrieve the list of URLs from the hash when it's done spidering. I guess that must have been because it's already printing each as it's visited, so I assumed that was all the output you needed. In any case, to get a list of all the URLs after its finished, just use "keys %visited":
my @visited_urls = keys %visited;    # put them into an array...
print join "\n", sort keys %visited; # print them one per line in alphabetical order...
report_visits(keys %visited);        # pass them to a sub...
launch_icbm_at(get_coords($_)) for keys %visited;
                                     # nuke all the servers...


Was This Post Helpful? 0
  • +
  • -

#12 binaryking   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 1
  • Joined: 17-March 12

Re: Perl Web Crawler

Posted 17 March 2012 - 08:11 AM

hey,
i found this code snippet really awesome. But, I want that the spider doesn't add URLs of yahoo.com. how do I do that?
Was This Post Helpful? 0
  • +
  • -

#13 Cbeppe   User is offline

  • D.I.C Head
  • member icon

Reputation: 31
  • View blog
  • Posts: 216
  • Joined: 16-September 09

Re: Perl Web Crawler

Posted 30 March 2012 - 08:11 AM

This topic is REALLY old, so you might have been better off starting a new thread, but anyway...

Here's how to check if your link is a yahoo domain in general:
if ($link =~ m/yahoo.com/ig) {
    # Is Yahoo.com
} else {
    # Add to file
}



You could also adapt that regex to match other yahoo domains, for example:

if ($link =~ m/yahoo.\w{2,3}/){
    # Is yahoo.xxx
}



Hope this helps, although this thread should probably be locked now...
- Cbeppe

EDIT: Code actually works now ;)

This post has been edited by Cbeppe: 30 March 2012 - 08:17 AM

Was This Post Helpful? 0
  • +
  • -

#14 sayhello   User is offline

  • D.I.C Regular

Reputation: 4
  • View blog
  • Posts: 272
  • Joined: 12-November 17

Re: Perl Web Crawler

Posted 26 January 2018 - 03:16 AM

Many thanks for this code snippet - which is a valid tutorial for Perl-newbies as i am.

i have went through all the code - and i have gained alot.

Again - many thanks

keep up the great work here - i guess that you have helped alot Learners.


regards say
Was This Post Helpful? 0
  • +
  • -

#15 sayhello   User is offline

  • D.I.C Regular

Reputation: 4
  • View blog
  • Posts: 272
  • Joined: 12-November 17

Re: Perl Web Crawler

Posted 26 January 2018 - 03:38 AM

By the way - what about to tweak this nice code snippet so that it geths links out of this
site http://europa.eu/you...ganisation#open


love to hear from you
Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2