10 Replies - 601 Views - Last Post: 08 October 2012 - 01:04 PM Rate Topic: -----

#1 NotQuiteThereYet  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 5
  • Joined: 07-October 12

Getting a list of URLs of all images that appear on a webpage?

Posted 07 October 2012 - 12:36 PM

Hi,

I'm trying to use VB.NET (2010) to get the absolute URLs of each image that appears on a specific webpage. So far, I've figured out how to get all of the URLs inside of a HTML <img> tag, like so...

        For Each SeparateImage As HtmlElement In WebBrowser1.document.Images
        ListBox1.Items.Add(SeparateImage.GetAttribute("src"))
        Next


That works perfectly. But what I can't figure out is how to extract image URLs that appear within CSS styles. Like this...

background-image:url('image.jpg');


Does anyone know of a simple way to do this? I would need to extract the image URLs not only from inline CSS code, but from external stylesheets as well.

I reckon that one way to do it would be to grab the source code of the entire HTML page and related CSS stylesheet, and then parse out all of the image URLs using a bunch of string splits and/or regex. But that could get pretty complicated to figure out the correct absolute URL of each image, because of all the different possibilities of "relative" URL paths I may come across. For example...

background-image:url('image.jpg');

background-image:url('/image.jpg');

background-image:url('./image.jpg');

background-image:url('../image.jpg');

background-image:url('../otherdirectory/image.jpg');


So... it would be really nice if something like this existed...

        For Each CSS_Style As HtmlElement In WebBrowser1.document.Styles
        ListBox1.Items.Add(CSS_Style.GetAttribute("background-image"))
        Next


(I know the above code doesn't work... it's just an example). So... does anyone know how I might be able to accomplish something like that? Or have any other ideas that don't involve mind numbing amounts of regex and logic? :)

Thanks in advance!

Is This A Good Question/Topic? 0
  • +

Replies To: Getting a list of URLs of all images that appear on a webpage?

#2 modi123_1  Icon User is online

  • Suitor #2
  • member icon



Reputation: 8948
  • View blog
  • Posts: 33,544
  • Joined: 12-June 08

Re: Getting a list of URLs of all images that appear on a webpage?

Posted 07 October 2012 - 12:40 PM

You know the path of the site, right? Append that to the image paths from the css.. sure a bit of trickery when dealing with the ../ but it should be as simple as smashing two strings together.
Was This Post Helpful? 0
  • +
  • -

#3 NotQuiteThereYet  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 5
  • Joined: 07-October 12

Re: Getting a list of URLs of all images that appear on a webpage?

Posted 08 October 2012 - 07:44 AM

View Postmodi123_1, on 07 October 2012 - 12:40 PM, said:

You know the path of the site, right? Append that to the image paths from the css.. sure a bit of trickery when dealing with the ../ but it should be as simple as smashing two strings together.


No, the sites will vary, and will not be known ahead of time (they will be input by the end user). So, I will have to account for every possible scenario in the code.

I was really hoping that there was a simple way to do this. It's strange that MS would make it easy to extract any HTML element from a page, but not have an easy way to also extract CSS styles from a page.
Was This Post Helpful? 0
  • +
  • -

#4 modi123_1  Icon User is online

  • Suitor #2
  • member icon



Reputation: 8948
  • View blog
  • Posts: 33,544
  • Joined: 12-June 08

Re: Getting a list of URLs of all images that appear on a webpage?

Posted 08 October 2012 - 08:08 AM

Quote

No, the sites will vary, and will not be known ahead of time (they will be input by the end user).

*sigh* Yes, but at some point the user will tell the application the location of the site to point to, right? That is when it 'knows'.. that is when it stores that directory and then rips apart the CSS. It's nothing but a bit of regex or string manipulation and maybe take thirty minutes to write up.
Was This Post Helpful? 0
  • +
  • -

#5 NotQuiteThereYet  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 5
  • Joined: 07-October 12

Re: Getting a list of URLs of all images that appear on a webpage?

Posted 08 October 2012 - 09:49 AM

View Postmodi123_1, on 08 October 2012 - 08:08 AM, said:

Quote

No, the sites will vary, and will not be known ahead of time (they will be input by the end user).

*sigh* Yes, but at some point the user will tell the application the location of the site to point to, right? That is when it 'knows'.. that is when it stores that directory and then rips apart the CSS. It's nothing but a bit of regex or string manipulation and maybe take thirty minutes to write up.


Sorry, I misunderstood what you were asking. It might take YOU 30 minutes to code, but I'm not a VB "guru" such as yourself, so it would probably take me the better part of an afternoon to figure out how to do all that. Which is why I was looking for a more efficient solution. If I can't find one... guess I will bite the bullet and roll up my sleeves. I'm just surprised that MS does not provide a built-in function to accomplish this.
Was This Post Helpful? 0
  • +
  • -

#6 modi123_1  Icon User is online

  • Suitor #2
  • member icon



Reputation: 8948
  • View blog
  • Posts: 33,544
  • Joined: 12-June 08

Re: Getting a list of URLs of all images that appear on a webpage?

Posted 08 October 2012 - 09:55 AM

I am not sure what sort of built in functionality you were expecting? A CSS file (if using a typical webrequest stream is just a giant ol' string... to the compiler a string is a string is a string, right? Why does "<body><div id="a"><img src='foo.jpg'></div></body>" mean anything more than "foo.jpg"?
Was This Post Helpful? 0
  • +
  • -

#7 trevster344  Icon User is offline

  • The Peasant
  • member icon

Reputation: 224
  • View blog
  • Posts: 1,499
  • Joined: 16-March 11

Re: Getting a list of URLs of all images that appear on a webpage?

Posted 08 October 2012 - 10:17 AM

There's no built in function? Ha! That's what OOP is all about, YOU make the function and encapsulate it in some nifty object for yourself so you never have to recreate it, and just take it from project to project haha. You just need to look at it piece by piece and establish an algorithm. That's what object oriented programming is all about.
Was This Post Helpful? 0
  • +
  • -

#8 AdamSpeight2008  Icon User is offline

  • MrCupOfT
  • member icon


Reputation: 2241
  • View blog
  • Posts: 9,412
  • Joined: 29-May 08

Re: Getting a list of URLs of all images that appear on a webpage?

Posted 08 October 2012 - 10:40 AM

It doesn't matter what you programming language you use for the final implementation. The algorithm is still the same.

 Input: URL
Output: Image Urls

  For Each Img Tag get imgurl
    If imgurl is relative then
      imgurl = Get_non_relative_verision(parent_parth, img_url)
    End If
    yield imgurl
  Next 


Was This Post Helpful? 0
  • +
  • -

#9 NotQuiteThereYet  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 5
  • Joined: 07-October 12

Re: Getting a list of URLs of all images that appear on a webpage?

Posted 08 October 2012 - 10:55 AM

View Postmodi123_1, on 08 October 2012 - 09:55 AM, said:

I am not sure what sort of built in functionality you were expecting?


See the last code block in my original post. Something like that would make it much easier than trying to figure out all of the possible relative URL paths (of which there are many), and then having to write code to handle each one.

Anyways, it appears that something like that does not exist, so I guess it's time for me to get off the forum and build it myself!

Thanks anyways.
Was This Post Helpful? 0
  • +
  • -

#10 dotINSolution  Icon User is offline

  • New D.I.C Head

Reputation: 6
  • View blog
  • Posts: 16
  • Joined: 25-September 12

Re: Getting a list of URLs of all images that appear on a webpage?

Posted 08 October 2012 - 11:06 AM

The reason it isn't their is possible because css tags list keeps growing. As for as you said, it won't take a long time to write your own. Just parse the values of those tags.

For taking care of relative urls i remember doing similar in a past project, you can use Uri Class

The Uri(string, string) constructor should help you where 1st parameter is base url and second is relative url :)

Dim url as New Uri("http://www.website.com/somepage.html", "/image.jpg")


Was This Post Helpful? 1
  • +
  • -

#11 NotQuiteThereYet  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 5
  • Joined: 07-October 12

Re: Getting a list of URLs of all images that appear on a webpage?

Posted 08 October 2012 - 01:04 PM

View PostdotINSolution, on 08 October 2012 - 11:06 AM, said:

The reason it isn't their is possible because css tags list keeps growing. As for as you said, it won't take a long time to write your own. Just parse the values of those tags.

For taking care of relative urls i remember doing similar in a past project, you can use Uri Class

The Uri(string, string) constructor should help you where 1st parameter is base url and second is relative url :)

Dim url as New Uri("http://www.website.com/somepage.html", "/image.jpg")



Thanks for the tip!
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1