Web Scraper/Spider

working but not sure how ;p

Page 1 of 1

1 Replies - 2063 Views - Last Post: 11 December 2008 - 11:30 AM Rate Topic: -----

#1 Xioshin  Icon User is offline

  • D.I.C Regular

Reputation: 4
  • View blog
  • Posts: 264
  • Joined: 05-November 08

Web Scraper/Spider

Posted 11 December 2008 - 02:08 AM

Hey.. workin on a basic basic web scraper (not so much a spider yet because I'm not creating a list of links to process), and I got the results I wanted but not sure if I just got lucky or not..


Basically I'm looking for every .jpg on my personal website.
My code is:

$url = "http://www.progressiongames.com"; // This is the url for Microsoft

$page = file_get_contents($url); //downloads the HTML and put it into a variable called $page

$pattern = '/ src="\/warimages\/(.+).jpg" /';

preg_match_all($pattern, $page, $results, PREG_PATTERN_ORDER); //uses pattern to check the web page and puts results in the $results array

$size = count($results) - 1;

$counter = 0;
foreach ($results[$size] as $value) {
	echo "{$counter}: {$value}<BR>";
	$counter = $counter + 1;
	
}



My question is this. For some reason I get back 2 layers of my array inside of $results, 0 and 1. It seems as tho the first , Array[0][# of elements] is the entire pattern I was searching for, and then the deepest dimension Array[1][0-# of elements], has the actual string I was looking for.

Because I wanted the DEEPEST ELEMENT, I came up with the method of getting the size of the array, (2), and subtracted 1 to compensate for starting at Array[0].
So this is working, but did I get lucky? Will I ever have more than 2 dimensions
of my array, or will I only EVER get the first pattern, and then the results?

OUTPUT:
0: main1
1: main2
2: tier2main
3: main3
4: killexp4
5: main5
6: pic3
7: pic4
8: pic5
9: pic6
10: pic7
11: killexp2
12: killexp3
13: killexp


Is This A Good Question/Topic? 0
  • +

Replies To: Web Scraper/Spider

#2 Martyr2  Icon User is offline

  • Programming Theoretician
  • member icon

Reputation: 4188
  • View blog
  • Posts: 11,852
  • Joined: 18-April 07

Re: Web Scraper/Spider

Posted 11 December 2008 - 11:30 AM

On php.net look at the flag for preg_match_all "PREG_PATTERN_ORDER". What this does, and why you see two dimensions, is that the first dimension holds groups of matches. The first set of results is for the whole match... as you are aware of, but because you have a group in the pattern specified (.+) the second results are the specific results of that group being matched... the filename minus the extension of course.

Now each subscript of the array is going to correspond to each group you specify in the pattern. So for instance lets modify your pattern a smidge to this.... $pattern = '/(.{2,3}) src="\/warimages\/(.+).jpg" /'; Now we are going to have a results array with 3 results...

Array
(
	[0] => Array
		(
			[0] => img src="/warimages/main1.jpg" 
			[1] => img src="/warimages/main2.jpg" 
			[2] => img src="/warimages/tier2main.jpg" 
			[3] => img src="/warimages/main3.jpg" 
			[4] => img src="/warimages/killexp4.jpg" 
			[5] => img src="/warimages/main5.jpg" 
			[6] => img src="/warimages/pic3.jpg" 
			[7] => img src="/warimages/pic4.jpg" 
			[8] => img src="/warimages/pic5.jpg" 
			[9] => img src="/warimages/pic6.jpg" 
			[10] => img src="/warimages/pic7.jpg" 
			[11] => img src="/warimages/killexp2.jpg" 
			[12] => img src="/warimages/killexp3.jpg" 
			[13] => img src="/warimages/killexp.jpg" 
		)

	[1] => Array
		(
			[0] => img
			[1] => img
			[2] => img
			[3] => img
			[4] => img
			[5] => img
			[6] => img
			[7] => img
			[8] => img
			[9] => img
			[10] => img
			[11] => img
			[12] => img
			[13] => img
		)

	[2] => Array
		(
			[0] => main1
			[1] => main2
			[2] => tier2main
			[3] => main3
			[4] => killexp4
			[5] => main5
			[6] => pic3
			[7] => pic4
			[8] => pic5
			[9] => pic6
			[10] => pic7
			[11] => killexp2
			[12] => killexp3
			[13] => killexp
		)

)



Notice that the second element of our first array is an array full of "img". What is the deal? Well the group (.{2,3}) is seen as a group and it says get the preceding 2 or 3 characters that come before " src" which in an image is "img" most of the time. So the first element contains the large matches, the second subscript (1) is containing the results of our first group match .. the (.{2,3}) and the third subscript are for the results of your second group... (.+).

So the more groups you put in your pattern, the more result arrays you are going to see here. So for your given pattern you can assume there will be two arrays being produced. But if you change the pattern you can easily go outside of just having two and have many many more.

Hope I have made this clear enough.

"At DIC we be group matching code ninjas... capty and sloth, chris and kya, nykc and olive.... gabehabe and... and... crap no one will match. Oh well." :snap:

This post has been edited by Martyr2: 11 December 2008 - 11:31 AM

Was This Post Helpful? 1
  • +
  • -

Page 1 of 1