My code to do this is quite simple all it requires is some knowledge of Regular expressions and HTML. If not don't worry its all supplied with the code.
So here is my code I will explain it after.
Imports System.Web
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Imports System.Threading
Public Class Spider_Class
Private Const Regex_Href_Code As String = "href\s*=\s*(?:""(?<match>[^""]*)""|(?<match>\S+))"
Public Shared Function RunSpider(ByVal StartURL As String, ByVal Depth As Integer) As ArrayList
Return iSpiderRun(StartURL, Depth)
End Function
Private Shared Function iSpiderRun(ByVal URL As String, ByVal Depth As Integer) As ArrayList
Dim ReturnValue As New ArrayList
Try
Dim CurrentDepth As Integer = 0
Dim URLsToSpider As New ArrayList
' Declare a temporary holding space for scanned URL's.
Dim SpiderdUrls As New ArrayList
' Start the ball rolling.
Dim NewUrlList As New ArrayList
NewUrlList.AddRange(iSpiderURL(URL))
Dim URLs As String = ""
For Each item As String In NewUrlList
URLs = URLs & vbCrLf & item
Next
' MsgBox(URLs)
' Simple check, should always return false, but just to be safe...
If Not SpiderdUrls.Contains(URL) Then
SpiderdUrls.Add(URL)
End If
' Now for the fun
Do Until CurrentDepth = Depth
Dim xcor As New ArrayList
For Each item As String In NewUrlList
Try
URLsToSpider.AddRange(iSpiderURL(item))
Catch ex As Exception
URLsToSpider.Add("Error Whilst processing: " & item & ". The error was: " & ex.Message)
End Try
Next
CurrentDepth = CurrentDepth + 1
SpiderdUrls.AddRange(NewUrlList)
NewUrlList = New ArrayList
NewUrlList.AddRange(URLsToSpider)
URLsToSpider = New ArrayList
Loop
ReturnValue.AddRange(SpiderdUrls)
Catch ex As Exception
End Try
Return ReturnValue
End Function
Private Shared Function iSpiderURL(ByVal URL As String) As ArrayList
Dim ReturnValue As New ArrayList
Dim BaseURI As New Uri(URL)
Dim Page As String = String.Empty
Try
Try
Dim PageRequest As HttpWebRequest = CType(WebRequest.Create(BaseURI), HttpWebRequest)
'Request a response
Dim PageResponse As HttpWebResponse = PageRequest.GetResponse
'Response Recived...
Dim PageReader As New StreamReader(PageResponse.GetResponseStream)
'Read the page to a local varible
Page = PageReader.ReadToEnd
'Tidy Up
PageReader.Close()
' Now Onto the regex stuff...
Catch ex As Exception
Try
Dim WebReader As New WebClient()
Page = WebReader.DownloadString(URL)
Catch eix As Exception
' MsgBox(eix.Message)
Return ReturnValue
Exit Function
End Try
End Try
MsgBox(Page, MsgBoxStyle.Exclamation, "Page.View.Eception.Unheld")
Dim HrefRegex As New Regex(Regex_Href_Code, RegexOptions.IgnoreCase Or RegexOptions.Compiled)
Dim HrefMatch As Match = HrefRegex.Match(Page)
Do While HrefMatch.Success = True
Dim Link As String = HrefMatch.Groups(1).Value
If Link.Substring(0, 1) = "#" Then
'Ignore (Its a Bookmark)
Else
'Use it!!!
Dim Absolute As Boolean = False
If Link.Length > 8 Then
Dim Scheme As String = Uri.UriSchemeHttp & "://>/"
End If
End If
HrefMatch = HrefMatch.NextMatch
Loop
Catch ex As Exception
'MsgBox(ex.Message, MsgBoxStyle.Critical, "IspiderURL")
End Try
Return ReturnValue
End Function
End Class
As you may have noticed I have commented out all of the message boxes this is because this code creates so many errors when its just running its not worth it. If I was going to do this code for a business I would have added events and raised errors that way.
You may have worked out that this code asks for a depth:
Public Shared Function RunSpider(ByVal StartURL As String, ByVal Depth As Integer) As ArrayList
Return iSpiderRun(StartURL, Depth)
End Function
The depth is basically my way of limiting the number of links opened. So if you set the site depth to 3 then this is what will happen; What this code does is return all the links in a web page, and adds them to a list to be checked:
The first site will be spiderd for URL's, they will be added to a running list of URL's pending for checking. All the links in that pending list will then be spiderd, and those links will be added to the pending list, then finally those links will be checked and added to the pending list.
Bare in mind that to do this we are creating http connections which do rather consume ram and processor time.
So if we had a depth of 3 and each site we visited had 10 links to other pages that have 10 links then we would end up with:
1010x1010x1010 which is a very big number, and when I first set up this I tested to see if I could go to a depth of 100 URLS and that is *deep breath*,
100000000000000000000000000000000000000000000000000000000000000000000000000000000000000100
Which is too long to fit in here. Oh and by the way it crashed my computer, if you didnt guess
Okay so now you know that you should not go much above 10 URL's, I will tell you the uses of this code, it can be used for working out all the links inside of a domain if you exclude all the outbound links, this way you could physically build a map of a site.
Okay so this piece of code:
Public Shared Function RunSpider(ByVal StartURL As String, ByVal Depth As Integer) As ArrayList
Return iSpiderRun(StartURL, Depth)
End Function
This is just a nice way for users to access the main function of this class as it allows for clean code viewing.
Private Shared Function iSpiderURL(ByVal URL As String) As ArrayList
Dim ReturnValue As New ArrayList
Dim BaseURI As New Uri(URL)
Dim Page As String = String.Empty
Try
Try
Dim PageRequest As HttpWebRequest = CType(WebRequest.Create(BaseURI), HttpWebRequest)
'Request a response
Dim PageResponse As HttpWebResponse = PageRequest.GetResponse
'Response Recived...
Dim PageReader As New StreamReader(PageResponse.GetResponseStream)
'Read the page to a local varible
Page = PageReader.ReadToEnd
'Tidy Up
PageReader.Close()
' Now Onto the regex stuff...
Catch ex As Exception
Try
Dim WebReader As New WebClient()
Page = WebReader.DownloadString(URL)
Catch eix As Exception
' MsgBox(eix.Message)
Return ReturnValue
Exit Function
End Try
End Try
MsgBox(Page, MsgBoxStyle.Exclamation, "Page.View.Eception.Unheld")
Dim HrefRegex As New Regex(Regex_Href_Code, RegexOptions.IgnoreCase Or RegexOptions.Compiled)
Dim HrefMatch As Match = HrefRegex.Match(Page)
Do While HrefMatch.Success = True
Dim Link As String = HrefMatch.Groups(1).Value
If Link.Substring(0, 1) = "#" Then
'Ignore (Its a Bookmark)
Else
'Use it!!!
Dim Absolute As Boolean = False
If Link.Length > 8 Then
Dim Scheme As String = Uri.UriSchemeHttp & "://>/"
' If Link.Substring(0, Scheme.Length) Then
'Absolute = True
' End If
' If Absolute = True Then
'ReturnValue.Add(Link)
' Else
Dim NewUri As New Uri(BaseURI, Link)
ReturnValue.Add(NewUri.ToString)
' End If
End If
End If
HrefMatch = HrefMatch.NextMatch
Loop
Catch ex As Exception
'MsgBox(ex.Message, MsgBoxStyle.Critical, "IspiderURL")
End Try
Return ReturnValue
End Function
This code basically uses regex to find all of the URL's in a page and returns an arraylist, the reason why this code is separate from the 'ispiderrun' function is that it is much MUCH cleaner to do this, if you wanted to you could change the regex string, to search for something else in the page.
And finally:
Private Shared Function iSpiderRun(ByVal URL As String, ByVal Depth As Integer) As ArrayList
Dim ReturnValue As New ArrayList
Try
Dim CurrentDepth As Integer = 0
Dim URLsToSpider As New ArrayList
' Declare a temporary holding space for scanned URL's.
Dim SpiderdUrls As New ArrayList
' Start the ball rolling.
Dim NewUrlList As New ArrayList
NewUrlList.AddRange(iSpiderURL(URL))
Dim URLs As String = ""
For Each item As String In NewUrlList
URLs = URLs & vbCrLf & item
Next
' MsgBox(URLs)
' Simple check, should always return false, but just to be safe...
If Not SpiderdUrls.Contains(URL) Then
SpiderdUrls.Add(URL)
End If
' Now for the fun
Do Until CurrentDepth = Depth
Dim xcor As New ArrayList
For Each item As String In NewUrlList
Try
URLsToSpider.AddRange(iSpiderURL(item))
Catch ex As Exception
URLsToSpider.Add("Error Whilst processing: " & item & ". The error was: " & ex.Message)
End Try
Next
CurrentDepth = CurrentDepth + 1
SpiderdUrls.AddRange(NewUrlList)
NewUrlList = New ArrayList
NewUrlList.AddRange(URLsToSpider)
URLsToSpider = New ArrayList
Loop
ReturnValue.AddRange(SpiderdUrls)
Catch ex As Exception
End Try
Return ReturnValue
End Function
This code is fairly hard to understand simply because if what we are doing. Remember a collection cannot be modified whilst being looped through, so what we do is put all of our newly found URL's into a different collection; 'SpiderdUrls' which we then juggle into the 'NewUrlList' arraylist. But then I had the problem that all the new URL's were not checked, so that's where the depth comes in, the bigger loop ensures that we get all the results we want.
Okay that's the code over, I would be interested to know how far people can go with depth in running this code. Also another thing to know is that this is a very high ram operation, VERY HIGH. So if you own a spare computer that is idle, run this code on it rather than your own computer.
I think I covered everything there, feel free to modify and redistribute any code, though if your selling it be careful about multi-threading.
Enjoy






MultiQuote



|