Page 1 of 1

Creating a Web Spider Shows you how to spider the World Wide Web Rate Topic: -----

#1 Asscotte  Icon User is offline

  • D.I.C Addict
  • member icon

Reputation: 35
  • View blog
  • Posts: 610
  • Joined: 08-February 09

Posted 08 March 2010 - 12:30 PM

Okay some of you reading this may be aware of what a Web Spider is, basically it is a piece of code, normally run on a server with too much ram, a good example of this is Google. Except they only spider links they are given, so you give the root address of your website and they detect all the links in the page and add them to the collection then, check the new links for links, ETC.

My code to do this is quite simple all it requires is some knowledge of Regular expressions and HTML. If not don't worry its all supplied with the code.

So here is my code I will explain it after.

Imports System.Web
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Imports System.Threading


Public Class Spider_Class

    Private Const Regex_Href_Code As String = "href\s*=\s*(?:""(?<match>[^""]*)""|(?<match>\S+))"

    Public Shared Function RunSpider(ByVal StartURL As String, ByVal Depth As Integer) As ArrayList
        Return iSpiderRun(StartURL, Depth)
    End Function


    Private Shared Function iSpiderRun(ByVal URL As String, ByVal Depth As Integer) As ArrayList
        Dim ReturnValue As New ArrayList
        Try
            Dim CurrentDepth As Integer = 0

            Dim URLsToSpider As New ArrayList


            ' Declare a temporary holding space for scanned URL's.
            Dim SpiderdUrls As New ArrayList
            ' Start the ball rolling.
            Dim NewUrlList As New ArrayList
            NewUrlList.AddRange(iSpiderURL(URL))

            Dim URLs As String = ""
            For Each item As String In NewUrlList
                URLs = URLs & vbCrLf & item
            Next
            '    MsgBox(URLs)
            ' Simple check, should always return false, but just to be safe...
            If Not SpiderdUrls.Contains(URL) Then
                SpiderdUrls.Add(URL)
            End If
            ' Now for the fun
            Do Until CurrentDepth = Depth
                Dim xcor As New ArrayList

                For Each item As String In NewUrlList
                    Try
                        URLsToSpider.AddRange(iSpiderURL(item))
                    Catch ex As Exception
                        URLsToSpider.Add("Error Whilst processing: " & item & ". The error was: " & ex.Message)
                    End Try
                Next

                CurrentDepth = CurrentDepth + 1

                SpiderdUrls.AddRange(NewUrlList)
                NewUrlList = New ArrayList
                NewUrlList.AddRange(URLsToSpider)
                URLsToSpider = New ArrayList
            Loop
            ReturnValue.AddRange(SpiderdUrls)
        Catch ex As Exception
        End Try
        Return ReturnValue
    End Function


    Private Shared Function iSpiderURL(ByVal URL As String) As ArrayList
        Dim ReturnValue As New ArrayList
        Dim BaseURI As New Uri(URL)
        Dim Page As String = String.Empty

        Try
            Try
                Dim PageRequest As HttpWebRequest = CType(WebRequest.Create(BaseURI), HttpWebRequest)
                'Request a response
                Dim PageResponse As HttpWebResponse = PageRequest.GetResponse
                'Response Recived...
                Dim PageReader As New StreamReader(PageResponse.GetResponseStream)
                'Read the page to a local varible
                Page = PageReader.ReadToEnd
                'Tidy Up
                PageReader.Close()
                ' Now Onto the regex stuff...
            Catch ex As Exception
                Try
                    Dim WebReader As New WebClient()
                    Page = WebReader.DownloadString(URL)
                Catch eix As Exception
                    '  MsgBox(eix.Message)
                    Return ReturnValue
                    Exit Function
                End Try
            End Try

            MsgBox(Page, MsgBoxStyle.Exclamation, "Page.View.Eception.Unheld")

            Dim HrefRegex As New Regex(Regex_Href_Code, RegexOptions.IgnoreCase Or RegexOptions.Compiled)
            Dim HrefMatch As Match = HrefRegex.Match(Page)


            Do While HrefMatch.Success = True
                Dim Link As String = HrefMatch.Groups(1).Value

                If Link.Substring(0, 1) = "#" Then
                    'Ignore (Its a Bookmark)
                Else
                    'Use it!!!
                    Dim Absolute As Boolean = False
                    If Link.Length > 8 Then
                        Dim Scheme As String = Uri.UriSchemeHttp & "://>/"

                 End If
                End If

                HrefMatch = HrefMatch.NextMatch
            Loop
        Catch ex As Exception
            'MsgBox(ex.Message, MsgBoxStyle.Critical, "IspiderURL")
        End Try

        Return ReturnValue
    End Function
End Class



As you may have noticed I have commented out all of the message boxes this is because this code creates so many errors when its just running its not worth it. If I was going to do this code for a business I would have added events and raised errors that way.

You may have worked out that this code asks for a depth:
   Public Shared Function RunSpider(ByVal StartURL As String, ByVal Depth As Integer) As ArrayList
        Return iSpiderRun(StartURL, Depth)
    End Function



The depth is basically my way of limiting the number of links opened. So if you set the site depth to 3 then this is what will happen; What this code does is return all the links in a web page, and adds them to a list to be checked:

The first site will be spiderd for URL's, they will be added to a running list of URL's pending for checking. All the links in that pending list will then be spiderd, and those links will be added to the pending list, then finally those links will be checked and added to the pending list.

Bare in mind that to do this we are creating http connections which do rather consume ram and processor time.

So if we had a depth of 3 and each site we visited had 10 links to other pages that have 10 links then we would end up with:

1010x1010x1010 which is a very big number, and when I first set up this I tested to see if I could go to a depth of 100 URLS and that is *deep breath*,

100000000000000000000000000000000000000000000000000000000000000000000000000000000000000100

Which is too long to fit in here. Oh and by the way it crashed my computer, if you didnt guess :stupid:.

Okay so now you know that you should not go much above 10 URL's, I will tell you the uses of this code, it can be used for working out all the links inside of a domain if you exclude all the outbound links, this way you could physically build a map of a site.

Okay so this piece of code:
   Public Shared Function RunSpider(ByVal StartURL As String, ByVal Depth As Integer) As ArrayList
        Return iSpiderRun(StartURL, Depth)
    End Function



This is just a nice way for users to access the main function of this class as it allows for clean code viewing.

  Private Shared Function iSpiderURL(ByVal URL As String) As ArrayList
        Dim ReturnValue As New ArrayList
        Dim BaseURI As New Uri(URL)
        Dim Page As String = String.Empty

        Try
            Try
                Dim PageRequest As HttpWebRequest = CType(WebRequest.Create(BaseURI), HttpWebRequest)
                'Request a response
                Dim PageResponse As HttpWebResponse = PageRequest.GetResponse
                'Response Recived...
                Dim PageReader As New StreamReader(PageResponse.GetResponseStream)
                'Read the page to a local varible
                Page = PageReader.ReadToEnd
                'Tidy Up
                PageReader.Close()
                ' Now Onto the regex stuff...
            Catch ex As Exception
                Try
                    Dim WebReader As New WebClient()
                    Page = WebReader.DownloadString(URL)
                Catch eix As Exception
                    '  MsgBox(eix.Message)
                    Return ReturnValue
                    Exit Function
                End Try
            End Try

            MsgBox(Page, MsgBoxStyle.Exclamation, "Page.View.Eception.Unheld")

            Dim HrefRegex As New Regex(Regex_Href_Code, RegexOptions.IgnoreCase Or RegexOptions.Compiled)
            Dim HrefMatch As Match = HrefRegex.Match(Page)


            Do While HrefMatch.Success = True
                Dim Link As String = HrefMatch.Groups(1).Value

                If Link.Substring(0, 1) = "#" Then
                    'Ignore (Its a Bookmark)
                Else
                    'Use it!!!
                    Dim Absolute As Boolean = False
                    If Link.Length > 8 Then
                        Dim Scheme As String = Uri.UriSchemeHttp & "://>/"

                        '   If Link.Substring(0, Scheme.Length) Then
                        'Absolute = True
                        '   End If

                        '     If Absolute = True Then
                        'ReturnValue.Add(Link)
                        '  Else
                        Dim NewUri As New Uri(BaseURI, Link)
                        ReturnValue.Add(NewUri.ToString)
                        ' End If
                    End If
                End If

                HrefMatch = HrefMatch.NextMatch
            Loop
        Catch ex As Exception
            'MsgBox(ex.Message, MsgBoxStyle.Critical, "IspiderURL")
        End Try

        Return ReturnValue
    End Function



This code basically uses regex to find all of the URL's in a page and returns an arraylist, the reason why this code is separate from the 'ispiderrun' function is that it is much MUCH cleaner to do this, if you wanted to you could change the regex string, to search for something else in the page.

And finally:
  Private Shared Function iSpiderRun(ByVal URL As String, ByVal Depth As Integer) As ArrayList
        Dim ReturnValue As New ArrayList
        Try
            Dim CurrentDepth As Integer = 0

            Dim URLsToSpider As New ArrayList


            ' Declare a temporary holding space for scanned URL's.
            Dim SpiderdUrls As New ArrayList
            ' Start the ball rolling.
            Dim NewUrlList As New ArrayList
            NewUrlList.AddRange(iSpiderURL(URL))

            Dim URLs As String = ""
            For Each item As String In NewUrlList
                URLs = URLs & vbCrLf & item
            Next
            '    MsgBox(URLs)
            ' Simple check, should always return false, but just to be safe...
            If Not SpiderdUrls.Contains(URL) Then
                SpiderdUrls.Add(URL)
            End If
            ' Now for the fun
            Do Until CurrentDepth = Depth
                Dim xcor As New ArrayList

                For Each item As String In NewUrlList
                    Try
                        URLsToSpider.AddRange(iSpiderURL(item))
                    Catch ex As Exception
                        URLsToSpider.Add("Error Whilst processing: " & item & ". The error was: " & ex.Message)
                    End Try
                Next

                CurrentDepth = CurrentDepth + 1

                SpiderdUrls.AddRange(NewUrlList)
                NewUrlList = New ArrayList
                NewUrlList.AddRange(URLsToSpider)
                URLsToSpider = New ArrayList
            Loop
            ReturnValue.AddRange(SpiderdUrls)
        Catch ex As Exception
        End Try
        Return ReturnValue
    End Function



This code is fairly hard to understand simply because if what we are doing. Remember a collection cannot be modified whilst being looped through, so what we do is put all of our newly found URL's into a different collection; 'SpiderdUrls' which we then juggle into the 'NewUrlList' arraylist. But then I had the problem that all the new URL's were not checked, so that's where the depth comes in, the bigger loop ensures that we get all the results we want.

Okay that's the code over, I would be interested to know how far people can go with depth in running this code. Also another thing to know is that this is a very high ram operation, VERY HIGH. So if you own a spare computer that is idle, run this code on it rather than your own computer.

I think I covered everything there, feel free to modify and redistribute any code, though if your selling it be careful about multi-threading.

Enjoy :D ,

Is This A Good Question/Topic? 1
  • +

Replies To: Creating a Web Spider

#2 Guest_polas*


Reputation:

Posted 02 May 2010 - 10:06 AM

where to put this code :( whay did you put code without explanation
So tell if i'm new so where i put you this code :(
No buttons and so on so what the hell this code then to use if you help me or tell people how to do it really with
buttons listbox and so on thanks :)
Was This Post Helpful? 0

#3 TechKid  Icon User is offline

  • D.I.C Head
  • member icon

Reputation: 3
  • View blog
  • Posts: 82
  • Joined: 04-September 10

Posted 25 November 2012 - 09:53 AM

Great code, great tutorial. I will try this! :)
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1