Page 1 of 1

Extract Images & Links from a webbrowser The 'src' and 'href' properties Rate Topic: -----

#1 Jack Eagles1  Icon User is offline

  • Pugnacious Penguin (inspired by no2pencil)
  • member icon

Reputation: 183
  • View blog
  • Posts: 1,152
  • Joined: 10-December 08

Posted 20 April 2010 - 08:22 AM

Hi, In my second tutorial, I'll be showing you how to extract the sources of images & the targets links from a webpage.

NOTE: You should have some basic knowledge of the webbrowser component before you start this tutorial. You should also have some basic knowledge of how HTML works.



1: Start up VB2008 (I use express edition), and create a new project. Name it whatever you want (I called it WebpageExtractor)
2: Make your form quite large (Enough to fill most of the screen)
3: Add a splitcontainer to your form. Next set the dock to horizontal. In the bottom part of the splitcontainer, add a Webbrowser.
4: In the top part of the splitcontainer, add two buttons, and a textbox.
5: Call the above objects whatever you want. I called the webbrowser WB, the first button btnGo, the second button btnExtract, and the textbox txtURL
6: Add a listbox to your form. Call it LstMain

7: Double click btnGo and add this code to the Click event of the button:
'Tell the webbrowser to navigate to the text in the URL textbox
WB.Navigate(TxtURL.Text)


Whenever we click btnGo, the webbrowser will navigate to whatever text there is in the txtURL textbox. This text must be a URL.

8: In the Click event for btnExtract, Add this code:
       'For every link in the current document...         
        For Each ele As HtmlElement In WB.document.Links
                 'Get whatever text there is in the 'href' attribute
                  Dim eletarget As String = ele.GetAttribute("href")
                 'Add it to the listbox
                  LstMain.Items.Add(eletarget)
                  'Carry on to the next link
        Next



Press F5 to run your project. Now navigate to a website with your browser, and click btnExtract. You should now see the targets of all the links on the webpage.

So what did we do there?

First, we got all the links in the browser, and got the 'href' attribute. In HTML, the href attribute represents the target of the link (where the link will tell the browser to navigate to). So we just extracted the information in the 'href' attribute, and then added it to the listbox.



9: Now add a new button to your project, and name it 'btnGetImageSources'
10: Double click the button to go to the code view, and add this code to the Click event of the button:
 'For every HtmlElement (such as a textbox, a button, or image) in the current document...  
 For Each ele As HtmlElement In WB.document.All
                'Make sure it's an image... rather crude but works...
                If ele.GetAttribute("src").IsNothing = True Then
                   'Do nothing
                Else
                    'Its an image
                    'If the source of the element (assuming there is one) contains .jpg
                    If ele.GetAttribute("src").ToLower.Contains(".jpg") Then
                      'Declare the source of the image as a string (not really neccesary)
                       Dim imgsrc As String = ele.GetAttribute("src")
                      'Add the string to the listbox
                       LstMain.Items.Add(imgsrc)
                    End If
                    If ele.GetAttribute("src").ToLower.Contains(".png") Then
                      'Declare the source of the image as a string (not really neccesary)
                       Dim imgsrc As String = ele.GetAttribute("src")
                      'Add the string to the listbox
                       LstMain.Items.Add(imgsrc)
                    End If
                    If ele.GetAttribute("src").ToLower.Contains(".gif") Then
                      'Declare the source of the image as a string (not really neccesary)
                       Dim imgsrc As String = ele.GetAttribute("src")
                      'Add the string to the listbox
                       LstMain.Items.Add(imgsrc)
                    End If
                    If ele.GetAttribute("src").ToLower.Contains(".bmp") Then
                      'Declare the source of the image as a string (not really neccesary)
                       Dim imgsrc As String = ele.GetAttribute("src")
                      'Add the string to the listbox
                       LstMain.Items.Add(imgsrc)
                    End If
                End If

 Next



Press F5 to run your project. Now navigate to a website with your browser, and click btnGetImageSources. You should now see sources of all the pictures on the webpage in your listbox.



So what did we do there?

Firstly, we got all the HtmlElements in the webbrowser document, and then checked if they had a 'src' attribute (if an object doesn't have a source attribute it's definitely not an image. If it does have a source, then it might be an image, but there are other objects which have sources, so this is not a difinitive test). Next, if the element had a source, then we checked if the source contained '.jpg, .gif, .png or .bmp'. If the source contained one of those strings then we added it to the listbox.


Instances where this code will not work:

Sometimes, images are declared like this:
<img src="/WebsiteBase/Favicon.ico">


Rather than like this:
<img src="http://www.mywebsite.co.uk/WebsiteBase/Favicon.ico">


This is because sometimes a website is like this windows filesystem in it's composure, referrences can be made to locations and files in certain instances without quoting the whole path, because the place from where the reference is being made is allready part of the path (that's the way I understand it, I may be wrong, but I'm prettey sure I'm right).

So if an image is declared as aforementioned, the program will just go and get the 'src' attribute, rather than the whole path of the image.

I think that I'm correct in saying that this also applies to the 'href' attribute of links.

Is This A Good Question/Topic? 2
  • +

Replies To: Extract Images & Links from a webbrowser

#2 harley16s  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 2
  • Joined: 29-January 11

Posted 20 April 2011 - 08:00 PM

its not working with the IsNothing String
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1