Screen Scraping certain websites

Some sites will not allow me to screen scrape

Page 1 of 1

13 Replies - 4162 Views - Last Post: 04 September 2009 - 07:35 PM

#1 fremgenc  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 119
  • Joined: 15-November 07

Screen Scraping certain websites

Post icon  Posted 01 June 2009 - 01:55 PM

Hello,

I am trying to screen scrape Dell's website because I would like to automatically update my database to match Dell's website. However, I don't think Dell allows screen scraping because EVERY other site I try, I can at least download their sites information.

Any ideas? There must be workarounds because browsers can access dell's site (obviously) but I cannot seem to do it in a programmatic way.
Is This A Good Question/Topic? 0
  • +

Replies To: Screen Scraping certain websites

#2 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Screen Scraping certain websites

Posted 02 June 2009 - 05:30 AM

I haven't looked at Dell's site, but the most likely case is that they're using AJAX (or some other javascript-based technique) to dynamically grab the content and insert it into the displayed page rather than including it in the actual initial HTML document. Turn off javascript in your browser, then load up the Dell page and you'll see it the way that your scraping program does.

As for how to get around this, you pretty much have to either find a scraper with javascript support (so that it can run the javascript which loads the content) or else dig through the page source manually to find the request(s) that the javascript submits to obtain the content and have your scraper load those instead of the main page's URL (which is a PITA to do and may break whenever Dell updates the site).
Was This Post Helpful? 0
  • +
  • -

#3 gregwhitworth  Icon User is offline

  • Tired.
  • member icon

Reputation: 219
  • View blog
  • Posts: 1,604
  • Joined: 20-January 09

Re: Screen Scraping certain websites

Posted 02 June 2009 - 08:35 AM

I have never heard of this process - is it simply for inserting Dell news onto your site? Or are you actually trying to steal all of their data and place it on your site?

--

Greg
Was This Post Helpful? 0
  • +
  • -

#4 fremgenc  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 119
  • Joined: 15-November 07

Re: Screen Scraping certain websites

Posted 02 June 2009 - 10:42 AM

Thanks for the replies. My eventual goal is to extract warranty information from Dell's site based on a given service tag. We have about 150 machines and I would like write a script to automatically update my database with Dell's information.

This link can be changed to fit every service tag:

"http://supportapj.dell.com/support/topics/topic.aspx/ap/shared/support/my_systems_info/en/details?c=in&cs=inbsd1&l=en&s=bsd&ServiceTag=8gmjt31&~tab=1"

And I can write a parser to extract the warranty information.

I am using ColdFusion and the error returned is "Connection Failure"

Dsherohman - I tried disabling java and javascript in my browser and I can still see the warranty information I need, so its not a problem with Javascript/AJAX (they use basic Javascript by the way)

The problem is, I can't even scrape dell.com or any page in Dell's domain. But I can scrape ANY other site.

Thanks for your help, I'll keep working on it
Was This Post Helpful? 0
  • +
  • -

#5 markhazlett9  Icon User is offline

  • Coding is a lifestyle
  • member icon

Reputation: 60
  • View blog
  • Posts: 1,666
  • Joined: 12-July 08

Re: Screen Scraping certain websites

Posted 02 June 2009 - 11:36 AM

View Postgregwhitworth, on 2 Jun, 2009 - 07:35 AM, said:

I have never heard of this process - is it simply for inserting Dell news onto your site? Or are you actually trying to steal all of their data and place it on your site?

--

Greg



I Have never heard of this either. If that's what you're wanting to do you will have to contact dell and ask permission to use their data. At that point IF they allow it then they will let you know how to access the into.

Cheers
Was This Post Helpful? 0
  • +
  • -

#6 fremgenc  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 119
  • Joined: 15-November 07

Re: Screen Scraping certain websites

Posted 02 June 2009 - 09:03 PM

Its a common technique, called screen scraping.

And I don't need permission. If that were the case Google would not exist for the same legal reasons. Google downloads the content from every webpage to then search from.

This post has been edited by fremgenc: 02 June 2009 - 09:06 PM

Was This Post Helpful? 0
  • +
  • -

#7 markhazlett9  Icon User is offline

  • Coding is a lifestyle
  • member icon

Reputation: 60
  • View blog
  • Posts: 1,666
  • Joined: 12-July 08

Re: Screen Scraping certain websites

Posted 02 June 2009 - 10:10 PM

View Postfremgenc, on 2 Jun, 2009 - 08:03 PM, said:

Its a common technique, called screen scraping.

And I don't need permission. If that were the case Google would not exist for the same legal reasons. Google downloads the content from every webpage to then search from.



My apologies, didn't understand the question properly.
Was This Post Helpful? 0
  • +
  • -

#8 gregwhitworth  Icon User is offline

  • Tired.
  • member icon

Reputation: 219
  • View blog
  • Posts: 1,604
  • Joined: 20-January 09

Re: Screen Scraping certain websites

Posted 03 June 2009 - 08:28 AM

Here's an interesting article - with actual answers from people that have obviously done this before - sorry for the lack of information on my part:

http://stackoverflow...t-of-javascript

--

Greg
Was This Post Helpful? 0
  • +
  • -

#9 fremgenc  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 119
  • Joined: 15-November 07

Re: Screen Scraping certain websites

Posted 03 June 2009 - 05:58 PM

Thank you for the replies guys.

However, I know how to screen scrape- I've done it numerous times. The problem lies with Dell.com. So I was wondering if anyone has ever heard of a website not allowing certain robots(as screen scraping software is called) to access their site, and possible workarounds.

I will try to contact Dell about this, but with their horrible customer service I doubt I'll get anything.

Thanks again!
Was This Post Helpful? 0
  • +
  • -

#10 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Screen Scraping certain websites

Posted 04 June 2009 - 03:53 AM

View Postfremgenc, on 4 Jun, 2009 - 12:58 AM, said:

However, I know how to screen scrape- I've done it numerous times. The problem lies with Dell.com. So I was wondering if anyone has ever heard of a website not allowing certain robots(as screen scraping software is called) to access their site, and possible workarounds.

Oh, most definitely. Read up on "robots.txt" or the "robot exclusion protocol". Dell's is at http://www.dell.com/robots.txt but doesn't appear to state that bots should stay off the site's front page.

Aside from that, your scraping bot almost certainly sends a User-Agent header to the server and the server can choose to send different content based on the User-Agent setting. (This is one of the techniques used for creating Internet Explorer, Mozilla, or iPhone-specific versions of pages.) If your User-Agent string is recognized as belonging to an "unwanted" piece of software, sending a blank page or an error back is easy to do.

As for workarounds... Honoring robots.txt is purely voluntary, so writing a bot which ignores it is trivial, and the software chooses its own User-Agent string, so writing something which claims to be FireFox is also very simple. But this is considered extremely bad form and there are a number of techniques available which webmasters can use to try to identify rogue bots. The consequences for getting caught can range from nasty email to having your IP address blocked to legal action for unauthorized access to their servers.

If Dell is actively attempting to prevent bots from accessing the content you're looking for, whether through robots.txt or User-Agent filtering, then I strongly advise you to contact them to identify an approved way of getting to it.
Was This Post Helpful? 0
  • +
  • -

#11 fremgenc  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 119
  • Joined: 15-November 07

Re: Screen Scraping certain websites

Posted 04 June 2009 - 08:20 PM

Dsherohman thank you, this is exactly what I was looking for!

I've tried sending my UA as my browser's UA a while ago but had no luck.

This is definitely possible- I just need to figure it out. There are proxy websites that allow me to access Dell's site through theirs, but maybe they are forwarding my User agent as their own.
Was This Post Helpful? 0
  • +
  • -

#12 fremgenc  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 119
  • Joined: 15-November 07

Re: Screen Scraping certain websites

Posted 15 June 2009 - 12:39 PM

UPDATE:

Hey everyone thank you for the help again.
I solved my problem by using ASP C# to complete my goal. Cold Fusion just won't work with Dell's site for some reason!

This post has been edited by fremgenc: 15 June 2009 - 12:39 PM

Was This Post Helpful? 0
  • +
  • -

#13 online  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 1
  • Joined: 03-September 09

Re: Screen Scraping certain websites

Posted 03 September 2009 - 12:34 AM

View Postfremgenc, on 15 Jun, 2009 - 11:39 AM, said:

UPDATE:

Hey everyone thank you for the help again.
I solved my problem by using ASP C# to complete my goal. Cold Fusion just won't work with Dell's site for some reason!


Hi fremgenc,

I do have exactly same requirement. I would apprciate if you can please put some light on what you did and how?

Thanks
Was This Post Helpful? 0
  • +
  • -

#14 codygman  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 26
  • Joined: 28-March 09

Re: Screen Scraping certain websites

Posted 04 September 2009 - 07:35 PM

View Postonline, on 2 Sep, 2009 - 11:34 PM, said:

View Postfremgenc, on 15 Jun, 2009 - 11:39 AM, said:

UPDATE:

Hey everyone thank you for the help again.
I solved my problem by using ASP C# to complete my goal. Cold Fusion just won't work with Dell's site for some reason!


Hi fremgenc,

I do have exactly same requirement. I would apprciate if you can please put some light on what you did and how?

Thanks


when screen scraping in C# heres something you'll find useful on sites that don't like "bots":

HttpWebRequest request = (HttpWebRequest)
				WebRequest.Create(url);

			request.Accept = "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
			request.ProtocolVersion = HttpVersion.Version10;
			request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)";
			//.Referer = url
			request.ContentType = "application/x-www-form-urlencoded";

			// execute the request
			HttpWebResponse response = (HttpWebResponse)
				request.GetResponse();



Live http headers is a great add on for firefox for use in screen scraping. Try and emulate everything that the browser does in the headers, they can't be too strict in detecting bots or they may block regular users.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1