Screen Scraping certain websitesSome sites will not allow me to screen scrape
Page 1 of 1
13 Replies - 3336 Views - Last Post: 04 September 2009 - 07:35 PM
#1
Screen Scraping certain websites
Posted 01 June 2009 - 01:55 PM
I am trying to screen scrape Dell's website because I would like to automatically update my database to match Dell's website. However, I don't think Dell allows screen scraping because EVERY other site I try, I can at least download their sites information.
Any ideas? There must be workarounds because browsers can access dell's site (obviously) but I cannot seem to do it in a programmatic way.
Replies To: Screen Scraping certain websites
#2
Re: Screen Scraping certain websites
Posted 02 June 2009 - 05:30 AM
As for how to get around this, you pretty much have to either find a scraper with javascript support (so that it can run the javascript which loads the content) or else dig through the page source manually to find the request(s) that the javascript submits to obtain the content and have your scraper load those instead of the main page's URL (which is a PITA to do and may break whenever Dell updates the site).
#3
Re: Screen Scraping certain websites
Posted 02 June 2009 - 08:35 AM
--
Greg
#4
Re: Screen Scraping certain websites
Posted 02 June 2009 - 10:42 AM
This link can be changed to fit every service tag:
"http://supportapj.dell.com/support/topics/topic.aspx/ap/shared/support/my_systems_info/en/details?c=in&cs=inbsd1&l=en&s=bsd&ServiceTag=8gmjt31&~tab=1"
And I can write a parser to extract the warranty information.
I am using ColdFusion and the error returned is "Connection Failure"
Dsherohman - I tried disabling java and javascript in my browser and I can still see the warranty information I need, so its not a problem with Javascript/AJAX (they use basic Javascript by the way)
The problem is, I can't even scrape dell.com or any page in Dell's domain. But I can scrape ANY other site.
Thanks for your help, I'll keep working on it
#5
Re: Screen Scraping certain websites
Posted 02 June 2009 - 11:36 AM
gregwhitworth, on 2 Jun, 2009 - 07:35 AM, said:
--
Greg
I Have never heard of this either. If that's what you're wanting to do you will have to contact dell and ask permission to use their data. At that point IF they allow it then they will let you know how to access the into.
Cheers
#6
Re: Screen Scraping certain websites
Posted 02 June 2009 - 09:03 PM
And I don't need permission. If that were the case Google would not exist for the same legal reasons. Google downloads the content from every webpage to then search from.
This post has been edited by fremgenc: 02 June 2009 - 09:06 PM
#7
Re: Screen Scraping certain websites
Posted 02 June 2009 - 10:10 PM
fremgenc, on 2 Jun, 2009 - 08:03 PM, said:
And I don't need permission. If that were the case Google would not exist for the same legal reasons. Google downloads the content from every webpage to then search from.
My apologies, didn't understand the question properly.
#8
Re: Screen Scraping certain websites
Posted 03 June 2009 - 08:28 AM
http://stackoverflow...t-of-javascript
--
Greg
#9
Re: Screen Scraping certain websites
Posted 03 June 2009 - 05:58 PM
However, I know how to screen scrape- I've done it numerous times. The problem lies with Dell.com. So I was wondering if anyone has ever heard of a website not allowing certain robots(as screen scraping software is called) to access their site, and possible workarounds.
I will try to contact Dell about this, but with their horrible customer service I doubt I'll get anything.
Thanks again!
#10
Re: Screen Scraping certain websites
Posted 04 June 2009 - 03:53 AM
fremgenc, on 4 Jun, 2009 - 12:58 AM, said:
Oh, most definitely. Read up on "robots.txt" or the "robot exclusion protocol". Dell's is at http://www.dell.com/robots.txt but doesn't appear to state that bots should stay off the site's front page.
Aside from that, your scraping bot almost certainly sends a User-Agent header to the server and the server can choose to send different content based on the User-Agent setting. (This is one of the techniques used for creating Internet Explorer, Mozilla, or iPhone-specific versions of pages.) If your User-Agent string is recognized as belonging to an "unwanted" piece of software, sending a blank page or an error back is easy to do.
As for workarounds... Honoring robots.txt is purely voluntary, so writing a bot which ignores it is trivial, and the software chooses its own User-Agent string, so writing something which claims to be FireFox is also very simple. But this is considered extremely bad form and there are a number of techniques available which webmasters can use to try to identify rogue bots. The consequences for getting caught can range from nasty email to having your IP address blocked to legal action for unauthorized access to their servers.
If Dell is actively attempting to prevent bots from accessing the content you're looking for, whether through robots.txt or User-Agent filtering, then I strongly advise you to contact them to identify an approved way of getting to it.
#11
Re: Screen Scraping certain websites
Posted 04 June 2009 - 08:20 PM
I've tried sending my UA as my browser's UA a while ago but had no luck.
This is definitely possible- I just need to figure it out. There are proxy websites that allow me to access Dell's site through theirs, but maybe they are forwarding my User agent as their own.
#12
Re: Screen Scraping certain websites
Posted 15 June 2009 - 12:39 PM
Hey everyone thank you for the help again.
I solved my problem by using ASP C# to complete my goal. Cold Fusion just won't work with Dell's site for some reason!
This post has been edited by fremgenc: 15 June 2009 - 12:39 PM
#13
Re: Screen Scraping certain websites
Posted 03 September 2009 - 12:34 AM
fremgenc, on 15 Jun, 2009 - 11:39 AM, said:
Hey everyone thank you for the help again.
I solved my problem by using ASP C# to complete my goal. Cold Fusion just won't work with Dell's site for some reason!
Hi fremgenc,
I do have exactly same requirement. I would apprciate if you can please put some light on what you did and how?
Thanks
#14
Re: Screen Scraping certain websites
Posted 04 September 2009 - 07:35 PM
online, on 2 Sep, 2009 - 11:34 PM, said:
fremgenc, on 15 Jun, 2009 - 11:39 AM, said:
Hey everyone thank you for the help again.
I solved my problem by using ASP C# to complete my goal. Cold Fusion just won't work with Dell's site for some reason!
Hi fremgenc,
I do have exactly same requirement. I would apprciate if you can please put some light on what you did and how?
Thanks
when screen scraping in C# heres something you'll find useful on sites that don't like "bots":
HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url); request.Accept = "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; request.ProtocolVersion = HttpVersion.Version10; request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)"; //.Referer = url request.ContentType = "application/x-www-form-urlencoded"; // execute the request HttpWebResponse response = (HttpWebResponse) request.GetResponse();
Live http headers is a great add on for firefox for use in screen scraping. Try and emulate everything that the browser does in the headers, they can't be too strict in detecting bots or they may block regular users.
|
|

New Topic/Question
Reply


MultiQuote






|