School Assignment? Project Due Tomorrow? Chat LIVE With A Programming Expert!

Welcome to Dream.In.Code
Become an Expert!

Join 300,398 Programmers for FREE! Get instant access to thousands of experts, tutorials, code snippets, and more! There are 1,620 people online right now. Registration is fast and FREE... Join Now!




Screen Scraping certain websites

 

Screen Scraping certain websites, Some sites will not allow me to screen scrape

fremgenc

1 Jun, 2009 - 12:55 PM
Post #1

D.I.C Head
**

Joined: 15 Nov, 2007
Posts: 119



Thanked: 3 times
My Contributions
Hello,

I am trying to screen scrape Dell's website because I would like to automatically update my database to match Dell's website. However, I don't think Dell allows screen scraping because EVERY other site I try, I can at least download their sites information.

Any ideas? There must be workarounds because browsers can access dell's site (obviously) but I cannot seem to do it in a programmatic way.

User is offlineProfile CardPM
+Quote Post


dsherohman

RE: Screen Scraping Certain Websites

2 Jun, 2009 - 04:30 AM
Post #2

D.I.C Head
**

Joined: 29 Mar, 2009
Posts: 184



Thanked: 35 times
My Contributions
I haven't looked at Dell's site, but the most likely case is that they're using AJAX (or some other javascript-based technique) to dynamically grab the content and insert it into the displayed page rather than including it in the actual initial HTML document. Turn off javascript in your browser, then load up the Dell page and you'll see it the way that your scraping program does.

As for how to get around this, you pretty much have to either find a scraper with javascript support (so that it can run the javascript which loads the content) or else dig through the page source manually to find the request(s) that the javascript submits to obtain the content and have your scraper load those instead of the main page's URL (which is a PITA to do and may break whenever Dell updates the site).
User is offlineProfile CardPM
+Quote Post

gregwhitworth

RE: Screen Scraping Certain Websites

2 Jun, 2009 - 07:35 AM
Post #3

(this).problem + "sucks";
Group Icon

Joined: 20 Jan, 2009
Posts: 1,131



Thanked: 82 times
Dream Kudos: 50
Expert In: HTML, CSS, Web Design

My Contributions
I have never heard of this process - is it simply for inserting Dell news onto your site? Or are you actually trying to steal all of their data and place it on your site?

--

Greg
User is offlineProfile CardPM
+Quote Post

fremgenc

RE: Screen Scraping Certain Websites

2 Jun, 2009 - 09:42 AM
Post #4

D.I.C Head
**

Joined: 15 Nov, 2007
Posts: 119



Thanked: 3 times
My Contributions
Thanks for the replies. My eventual goal is to extract warranty information from Dell's site based on a given service tag. We have about 150 machines and I would like write a script to automatically update my database with Dell's information.

This link can be changed to fit every service tag:

"http://supportapj.dell.com/support/topics/topic.aspx/ap/shared/support/my_systems_info/en/details?c=in&cs=inbsd1&l=en&s=bsd&ServiceTag=8gmjt31&~tab=1"

And I can write a parser to extract the warranty information.

I am using ColdFusion and the error returned is "Connection Failure"

Dsherohman - I tried disabling java and javascript in my browser and I can still see the warranty information I need, so its not a problem with Javascript/AJAX (they use basic Javascript by the way)

The problem is, I can't even scrape dell.com or any page in Dell's domain. But I can scrape ANY other site.

Thanks for your help, I'll keep working on it
User is offlineProfile CardPM
+Quote Post

markhazlett9

RE: Screen Scraping Certain Websites

2 Jun, 2009 - 10:36 AM
Post #5

Coding is a lifestyle
Group Icon

Joined: 12 Jul, 2008
Posts: 1,443



Thanked: 45 times
Dream Kudos: 25
My Contributions
QUOTE(gregwhitworth @ 2 Jun, 2009 - 07:35 AM) *

I have never heard of this process - is it simply for inserting Dell news onto your site? Or are you actually trying to steal all of their data and place it on your site?

--

Greg



I Have never heard of this either. If that's what you're wanting to do you will have to contact dell and ask permission to use their data. At that point IF they allow it then they will let you know how to access the into.

Cheers

User is offlineProfile CardPM
+Quote Post

fremgenc

RE: Screen Scraping Certain Websites

2 Jun, 2009 - 08:03 PM
Post #6

D.I.C Head
**

Joined: 15 Nov, 2007
Posts: 119



Thanked: 3 times
My Contributions
Its a common technique, called screen scraping.

And I don't need permission. If that were the case Google would not exist for the same legal reasons. Google downloads the content from every webpage to then search from.

This post has been edited by fremgenc: 2 Jun, 2009 - 08:06 PM
User is offlineProfile CardPM
+Quote Post

markhazlett9

RE: Screen Scraping Certain Websites

2 Jun, 2009 - 09:10 PM
Post #7

Coding is a lifestyle
Group Icon

Joined: 12 Jul, 2008
Posts: 1,443



Thanked: 45 times
Dream Kudos: 25
My Contributions
QUOTE(fremgenc @ 2 Jun, 2009 - 08:03 PM) *

Its a common technique, called screen scraping.

And I don't need permission. If that were the case Google would not exist for the same legal reasons. Google downloads the content from every webpage to then search from.



My apologies, didn't understand the question properly.
User is offlineProfile CardPM
+Quote Post

gregwhitworth

RE: Screen Scraping Certain Websites

3 Jun, 2009 - 07:28 AM
Post #8

(this).problem + "sucks";
Group Icon

Joined: 20 Jan, 2009
Posts: 1,131



Thanked: 82 times
Dream Kudos: 50
Expert In: HTML, CSS, Web Design

My Contributions
Here's an interesting article - with actual answers from people that have obviously done this before - sorry for the lack of information on my part:

http://stackoverflow.com/questions/857515/...t-of-javascript

--

Greg
User is offlineProfile CardPM
+Quote Post

fremgenc

RE: Screen Scraping Certain Websites

3 Jun, 2009 - 04:58 PM
Post #9

D.I.C Head
**

Joined: 15 Nov, 2007
Posts: 119



Thanked: 3 times
My Contributions
Thank you for the replies guys.

However, I know how to screen scrape- I've done it numerous times. The problem lies with Dell.com. So I was wondering if anyone has ever heard of a website not allowing certain robots(as screen scraping software is called) to access their site, and possible workarounds.

I will try to contact Dell about this, but with their horrible customer service I doubt I'll get anything.

Thanks again!
User is offlineProfile CardPM
+Quote Post

dsherohman

RE: Screen Scraping Certain Websites

4 Jun, 2009 - 02:53 AM
Post #10

D.I.C Head
**

Joined: 29 Mar, 2009
Posts: 184



Thanked: 35 times
My Contributions
QUOTE(fremgenc @ 4 Jun, 2009 - 12:58 AM) *

However, I know how to screen scrape- I've done it numerous times. The problem lies with Dell.com. So I was wondering if anyone has ever heard of a website not allowing certain robots(as screen scraping software is called) to access their site, and possible workarounds.

Oh, most definitely. Read up on "robots.txt" or the "robot exclusion protocol". Dell's is at http://www.dell.com/robots.txt but doesn't appear to state that bots should stay off the site's front page.

Aside from that, your scraping bot almost certainly sends a User-Agent header to the server and the server can choose to send different content based on the User-Agent setting. (This is one of the techniques used for creating Internet Explorer, Mozilla, or iPhone-specific versions of pages.) If your User-Agent string is recognized as belonging to an "unwanted" piece of software, sending a blank page or an error back is easy to do.

As for workarounds... Honoring robots.txt is purely voluntary, so writing a bot which ignores it is trivial, and the software chooses its own User-Agent string, so writing something which claims to be FireFox is also very simple. But this is considered extremely bad form and there are a number of techniques available which webmasters can use to try to identify rogue bots. The consequences for getting caught can range from nasty email to having your IP address blocked to legal action for unauthorized access to their servers.

If Dell is actively attempting to prevent bots from accessing the content you're looking for, whether through robots.txt or User-Agent filtering, then I strongly advise you to contact them to identify an approved way of getting to it.
User is offlineProfile CardPM
+Quote Post

fremgenc

RE: Screen Scraping Certain Websites

4 Jun, 2009 - 07:20 PM
Post #11

D.I.C Head
**

Joined: 15 Nov, 2007
Posts: 119



Thanked: 3 times
My Contributions
Dsherohman thank you, this is exactly what I was looking for!

I've tried sending my UA as my browser's UA a while ago but had no luck.

This is definitely possible- I just need to figure it out. There are proxy websites that allow me to access Dell's site through theirs, but maybe they are forwarding my User agent as their own.
User is offlineProfile CardPM
+Quote Post

fremgenc

RE: Screen Scraping Certain Websites

15 Jun, 2009 - 11:39 AM
Post #12

D.I.C Head
**

Joined: 15 Nov, 2007
Posts: 119



Thanked: 3 times
My Contributions
UPDATE:

Hey everyone thank you for the help again.
I solved my problem by using ASP C# to complete my goal. Cold Fusion just won't work with Dell's site for some reason!

This post has been edited by fremgenc: 15 Jun, 2009 - 11:39 AM
User is offlineProfile CardPM
+Quote Post

online

RE: Screen Scraping Certain Websites

2 Sep, 2009 - 11:34 PM
Post #13

New D.I.C Head
*

Joined: 2 Sep, 2009
Posts: 1

QUOTE(fremgenc @ 15 Jun, 2009 - 11:39 AM) *

UPDATE:

Hey everyone thank you for the help again.
I solved my problem by using ASP C# to complete my goal. Cold Fusion just won't work with Dell's site for some reason!


Hi fremgenc,

I do have exactly same requirement. I would apprciate if you can please put some light on what you did and how?

Thanks


User is offlineProfile CardPM
+Quote Post

codygman

RE: Screen Scraping Certain Websites

4 Sep, 2009 - 06:35 PM
Post #14

New D.I.C Head
*

Joined: 28 Mar, 2009
Posts: 22



Thanked: 1 times
My Contributions
QUOTE(online @ 2 Sep, 2009 - 11:34 PM) *

QUOTE(fremgenc @ 15 Jun, 2009 - 11:39 AM) *

UPDATE:

Hey everyone thank you for the help again.
I solved my problem by using ASP C# to complete my goal. Cold Fusion just won't work with Dell's site for some reason!


Hi fremgenc,

I do have exactly same requirement. I would apprciate if you can please put some light on what you did and how?

Thanks


when screen scraping in C# heres something you'll find useful on sites that don't like "bots":

CODE

HttpWebRequest request = (HttpWebRequest)
                WebRequest.Create(url);

            request.Accept = "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
            request.ProtocolVersion = HttpVersion.Version10;
            request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)";
            //.Referer = url
            request.ContentType = "application/x-www-form-urlencoded";

            // execute the request
            HttpWebResponse response = (HttpWebResponse)
                request.GetResponse();


Live http headers is a great add on for firefox for use in screen scraping. Try and emulate everything that the browser does in the headers, they can't be too strict in detecting bots or they may block regular users.
User is offlineProfile CardPM
+Quote Post

Fast ReplyReply to this topicStart new topic

Time is now: 11/7/09 10:11PM

Live Help!

Be Social

Dream.In.Code RSS Feed Dream.In.Code LinkedIn Group Follow Us On Twitter Fan Us On Facebook

Tutorials

Programming

Web Development

Reference Sheets

Code Snippets

DIC Chatroom

Bye Bye Ads

Monthly Drawing

Thumb Drive

Top Contributors

Top 10 Kudos This Month