Web Crawler

Web Crawler

Page 1 of 1

6 Replies - 2227 Views - Last Post: 05 August 2009 - 06:00 AM Rate Topic: -----

#1 crohole  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 24-July 09

Web Crawler

Posted 24 July 2009 - 03:25 AM

I have use the code to crawl all links in al pages, but I don't know how to make it just crawl internal links, not external links..

Please help me..
Is This A Good Question/Topic? 0
  • +

Replies To: Web Crawler

#2 RudiVisser  Icon User is offline

  • .. does not guess solutions
  • member icon

Reputation: 1004
  • View blog
  • Posts: 3,562
  • Joined: 05-June 09

Re: Web Crawler

Posted 24 July 2009 - 03:37 AM

If it crawls links, then it won't distinguish between them.

Please post code and errors and also what you've tried.
Was This Post Helpful? 0
  • +
  • -

#3 Wimpy  Icon User is offline

  • R.I.P. ( Really Intelligent Person, right? )
  • member icon

Reputation: 159
  • View blog
  • Posts: 1,038
  • Joined: 02-May 09

Re: Web Crawler

Posted 25 July 2009 - 06:04 AM

There should be at least two ways of checking this.

1. Check if the url is relative or belongs to the site by parsing the string in the "href" attribute.
2. Insert some syntax in the html code so that the links themselves tells you if they're internal or external, perhaps like this:
<a href="someurl" rel="external">Some Url</a>
<a href="otherurl" rel="internal">Other Url</a>

I have this syntax in my html which allows me to easily open external links in a new window using javascript instead of the "target" attribute to allow for xhtml strict dtd.

View Postcrohole, on 24 Jul, 2009 - 12:25 PM, said:

I have use the code to crawl all links in al pages, but I don't know how to make it just crawl internal links, not external links..

Please help me..

This post has been edited by Wimpy: 25 July 2009 - 06:05 AM

Was This Post Helpful? 0
  • +
  • -

#4 crohole  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 24-July 09

Re: Web Crawler

Posted 26 July 2009 - 12:21 AM

ok..this is the code..the code still have to crawl all links in all pages and have more than 1 the same link each others.
crawler.php
<?php
if (isset($_GET['url']))
	  {
	  include("db.php");
	  mysql_connect($host,$username,$password)
  
	  or die("Could not connect to MySQL server");
   
	  mysql_select_db($database) or die(mysql_error()."Could not select database");
   
	  $file=file_get_contents($_GET['url']);
   
	  $links=preg_split('/(href\=\'|href\=\"|href\=)/is',$file);
		 
	  mysql_query("INSERT INTO `indextemp` SET `url`='".$_GET['url']."', `stage`='1'");
  
	  $id=1;
  
	  while (isset($links[$id]))
  
	  {
  
	  $links[$id]=preg_replace("/([^\'])\'(.*)/is",'$1',$links[$id]);
  
	  $links[$id]=preg_replace("/([^\"])\"(.*)/is",'$1',$links[$id]);
  
	  $links[$id]=preg_replace("/([^\>])\>(.*)/is",'$1',$links[$id]);
  
	  $links[$id]=preg_replace("/([^ ])\ (.*)/is",'$1',$links[$id]);
  
	  $links[$id]=preg_replace("/([^\'])\'(.*)/i",'$1',$links[$id]);
  
	  $links[$id]=preg_replace("/([^\"])\"(.*)/i",'$1',$links[$id]);
  
	  $links[$id]=preg_replace("/([^\>])\>(.*)/i",'$1',$links[$id]);
  
	  $links[$id]=@preg_replace("/([^ ])\ (.*)/i",'$1',$links[$id]);
 
	  $ifexists=mysql_query("SELECT * FROM `indextemp` WHERE `url`='".$links[$id]."'");
  
	  if (mysql_num_rows($ifexists)==0 && strlen($links[$id])>16)
  
	  {
  
	  mysql_query("INSERT INTO `indextemp` SET `url`='".$links[$id]."', `stage`='0'");
 
	  echo $links[$id]."<br>";
  
	  }
  
	  $id+=1;
  
	  }

	  unset ($links);
 
	  $continue=1;
  
	  while ($continue=1)
  
	  {
	
	  $sqllinksa=mysql_query("SELECT * FROM `indextemp` WHERE `stage`='0'");
  
	  while ($sqllinks=mysql_fetch_array($sqllinksa))
  
	  {
  
	  $file=file_get_contents($sqllinks['url']);
  
	  $links=preg_split('/(href\=\'|href\=\"|href\=)/is',$file);

	  mysql_query("UPDATE `indextemp` SET `stage`='1' WHERE `url`='".$sqllinks['url']."'");
  
	  $id=1;
 
	  while (isset($links[$id]))
  
	  {
 
	  $links[$id]=preg_replace("/([^\'])\'(.*)/is",'$1',$links[$id]);
 
	  $links[$id]=preg_replace("/([^\"])\"(.*)/is",'$1',$links[$id]);
 
	  $links[$id]=preg_replace("/([^\>])\>(.*)/is",'$1',$links[$id]);
 
	  $links[$id]=preg_replace("/([^ ])\ (.*)/is",'$1',$links[$id]);
  
	  $links[$id]=preg_replace("/([^\'])\'(.*)/i",'$1',$links[$id]);
 
	  $links[$id]=preg_replace("/([^\"])\"(.*)/i",'$1',$links[$id]);
 
	  $links[$id]=preg_replace("/([^\>])\>(.*)/i",'$1',$links[$id]);
  
	  $links[$id]=preg_replace("/([^ ])\ (.*)/i",'$1',$links[$id]);
		
  
	  $ifexist=mysql_query("SELECT * FROM `indextemp` WHERE `url`='".$links[$id]."'");
  
	  if (strlen($links[$id])>5 && mysql_num_rows($ifexist)==0)
  
	  {
  
	  mysql_query("INSERT INTO `indextemp` SET `url`='".$links[$id]."', `stage`='0'");

	  echo $links[$id]."<br>";
  
	  } else { unset($ifexists['url']); }
  
	  $id+=1;
 
	  }

	  $ifexists=mysql_query("SELECT * FROM `indextemp` WHERE `url`='".$links[$id]."'");
  
	  $ifexists=mysql_fetch_array($ifexists);
 
	  if (!isset($ifexists['url']) && strlen($links[$id])>5)
  
	  {
  
	  mysql_query("INSERT INTO `indextemp` SET `url`='".$links[$id]."', `stage`='0'");
 
	  echo $links[$id]."<br>";
  
	  }
  
	  }
  
	  $checkcontinue=mysql_query("SELECT * FROM `indextemp` WHERE `stage`='0'");
  
	  if (mysql_num_rows($checkcontinue)==0)
  
	  {
  
	  $continue=0;
 
	  break;
  
	  }
  
	  }
		 
	  }
		echo "<form><input type='text' name='url' size=50><input type='submit' value='index'></form>";

	  ?>






db.php
<?
$username='user';
$password='';
$host='localhost';
$database='craw1';
?>



The Database name is craw1 :
Field :
url
stage


Please tell me how to make it just crawl internal links and save it to database with different links each others..Please
Was This Post Helpful? 0
  • +
  • -

#5 Wimpy  Icon User is offline

  • R.I.P. ( Really Intelligent Person, right? )
  • member icon

Reputation: 159
  • View blog
  • Posts: 1,038
  • Joined: 02-May 09

Re: Web Crawler

Posted 26 July 2009 - 05:08 AM

You would have to do something like this:
<?php
$internal_links = Array();
$links=preg_split('/(href\=\'|href\=\"|href\=)/is',$file);
foreach($links as $l)
{
	// parse the link
	if($is_internal)
	{
		$internal_links[] = $l;
	}
}
?>
I still don't know why I should help you though, since you haven't presented any effort at all from your part?
Was This Post Helpful? 0
  • +
  • -

#6 crohole  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 3
  • Joined: 24-July 09

Re: Web Crawler

Posted 29 July 2009 - 12:18 AM

But..where I must put that code...?????

Please lead me to the way
Was This Post Helpful? 0
  • +
  • -

#7 New_User  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 1
  • Joined: 05-August 09

Re: Web Crawler

Posted 05 August 2009 - 06:00 AM

Hi all ! I have wroten a web crawler, and the website, on which the crawler must to work, has links on javascript and they call a function which looks like "__post(arg1,arg2)".
When I click to them, the new page opens, but no any changes are made on the URL, and I don't know how access that pages in my code. Can anybody help me ?
Thanks !
P.S. Sorry for my english... :)

This post has been edited by New_User: 05 August 2009 - 06:02 AM

Was This Post Helpful? 0
  • +
  • -

Page 1 of 1