1 Replies - 1197 Views - Last Post: 13 March 2013 - 01:23 PM Rate Topic: -----

#1 squibby  Icon User is offline

  • D.I.C Head

Reputation: 5
  • View blog
  • Posts: 89
  • Joined: 21-January 12

Speed up query time in scraper script (php)

Posted 13 March 2013 - 12:55 PM

I have written following code to scrape information from Yahoo Local business listings and return to me in a table and CSV file. As the results are paginated on the site (only return 10 results per page), i have to query the site with my search terms and plug them into the URL along with a variable which sets the start page.

This is ok for result sets of maybe 20 or 30. If i have a results set that is large e.g All Schools in London, then there is so much data the script has to loop through many results and it times out.

Is there a way to send out just one query and bypass all the looping?

For example if a query returns 1960 results my script would need to loop 196 times requesting the data for each page. This is really inefficient.

I would really appreciate any suggestions from any PHP gurus on here. Thanks for reading.



<!DOCTYPE HTML>
<html>
<head>

<style>

body {font-family: ‘Lucida Sans Unicode’, ‘Lucida Grande’, sans-serif;}

	
table.archive {width:980px;position:relative;border-width: 0px;border-spacing: 0px;border-style: none;border-color: gray;border-collapse: collapse;background-color: white;border-left:solid 2px #fafafa;border-right:solid 2px #fafafa;border-bottom:solid 2px #fafafa;margin-top:10px;margin-bottom:10px;margin-left:auto;margin-right:auto;}
table.archive th {border-width: 1px;padding: 0px;border-bottom: solid 1 px #fafafa;border-color:  #fafafa;background-color: #fafafa;text-align:left;padding:10px;}

table.archive tr:hover td {background-color: yellow; color: #000;}
table.archive tr {border-bottom:solid 1px  #fafafa;}

table.archive td {padding:10px;font-size:12px;}



.info {width:960px; padding:10px;border:solid 1px silver;margin-bottom:10px;font-size:0.8em;margin-left:auto;margin-right:auto;}

.form {width:960px; padding:10px;border:solid 1px silver;margin-bottom:10px;font-size:0.7em;margin-left:auto;margin-right:auto;}

</style>

</head>



<body>

<div class = "info">
<p>Quick Scraping Tool</p>
</div>


<div class = "form">
	<form method = "POST" action = "index.php" >
	<label>Type (e.g electrican, massage, chinese, wine): </label><input type = "text" name = "industry">
	<label>Area: (e.g Clitheroe, leeds, blackburn) </label><input type = "text" name = "area">
	<input type = "submit" value = "get" name = "submit">
	</form>
</div>



<?php


if (isset($_POST['industry'])){
	$industry = $_POST['industry'];
}

if (isset($_POST['area'])){
	$area = $_POST['area'];
}


$startfrom = 0;

include('simple_html_dom.php');

// Create DOM from URL
$html = file_get_html('http://uk.local.yahoo.com/'.$area.'/'.$industry.'/search-16342.html?fr=sfp&cb='.$startfrom.'');

//find number of results
$results =  $html->find('div#top h1',0)->plaintext;
$split_results = explode(' ', $results);
$number_of_results = $split_results[5];
$number_of_results = str_replace(",", "",$number_of_results);

// determine how many results pages there will be.
$pages = ceil($number_of_results/10);
if ($pages == 0){
	echo "<div class ='info'>There were no results found - try different search terms</div>";
}

//for loop get result from each page and append to array
for ($i=1; $i<=$pages; $i++)
  {
 $html = file_get_html('http://uk.local.yahoo.com/'.$area.'/'.$industry.'/search-16342.html?fr=sfp&cb='.$startfrom.'');
//echo 'http://uk.local.yahoo.com/Lancashire/'.$area.'/'.$industry.'/search-16342.html?fr=sfp&cb='.$startfrom."<br>";

	foreach($html->find('li.vcard') as $article) {
			$item['name']     = $article->find('a.fn', 0)->plaintext;
			$item['number']    = $article->find('h3.tel', 0)->plaintext;
			$item['addr']    = $article->find('p.street-address', 0)->plaintext;
			$item['pcode'] = $article->find('p.postal-code', 0)->plaintext;
			$articles[] = $item;
	 }
	

	 // increment start page for url
		if ($startfrom == 0){
			$startfrom = $startfrom + 11;
		} else {
			$startfrom = $startfrom + 10;
		}
		
}

	
 

echo "<table class = 'archive'>
			<th>Company</th>
			<th>Address</th>
			<th>Postcode</th>
			<th>Number</th>";

foreach($articles as $item){
	
	echo "<tr>";
		echo "<td>".$item['name']."</td>";
		echo "<td>".$item['addr']."</td>";
		echo "<td>".$item['pcode']."</td>";
		echo "<td>".$item['number']."</td>";
	echo "</tr>";
	
}

echo "</table>";



// convert results into a downloadable excel file
$list = $articles;

$fp = fopen('file.csv', 'w');

foreach ($list as $fields) {
    fputcsv($fp, $fields);
}


echo "<div class = 'info'>Download as excel file <a href = 'file.csv'>here</a></div>";

echo "<div class = 'info'>There were ".$pages." pages scraped </br> There are ".$number_of_results." companies that match your search terms</div>";
?>

</body>
</html>







Is This A Good Question/Topic? 0
  • +

Replies To: Speed up query time in scraper script (php)

#2 modi123_1  Icon User is online

  • Suitor #2
  • member icon



Reputation: 9390
  • View blog
  • Posts: 35,264
  • Joined: 12-June 08

Re: Speed up query time in scraper script (php)

Posted 13 March 2013 - 01:23 PM

We will not help you violate Yahoo's TOS by scraping content. I am closing the topic. Do not persist in asking for help on illegal activities. If you have further questions on 'why' feel free to shoot me a pm.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1