3 Replies - 3198 Views - Last Post: 07 March 2011 - 03:17 PM Rate Topic: -----

#1 SleepingInChapel  Icon User is offline

  • D.I.C Head

Reputation: 5
  • View blog
  • Posts: 59
  • Joined: 02-January 09

Simple Screen Scrape

Posted 07 March 2011 - 02:15 PM

Hi, I've been looking around the Internet for a simple example of screen scraping with ColdFusion, but I haven't found one that is easy to understand. All I need to do is scrape a single div container from one page, and place it on my page. I would use jQuery.load(), but I've found that cross-domain ajax requests are disallowed by the browser.

Here's what I have so far:
<cfhttp url="http://www.mysupersecretwebsite.com" method="GET">
<cfset myDocument = trim(cfhttp.fileContent)>
<cfoutput>#myDocument#</cfoutput>



I need to be able to parse out a div with the ID of "p473I" and place it on my page. I'm not very good with regular expressions... any help would be appreciated. Thanks!

Is This A Good Question/Topic? 0
  • +

Replies To: Simple Screen Scrape

#2 SleepingInChapel  Icon User is offline

  • D.I.C Head

Reputation: 5
  • View blog
  • Posts: 59
  • Joined: 02-January 09

Re: Simple Screen Scrape

Posted 07 March 2011 - 02:35 PM

It wasn't 5 minutes after I had posted this, then I found something. Here's what I ended up doing:

<cfhttp url="http://www.mysupersecretwebsite.com" method="GET">
<cfset myDocument = trim(cfhttp.fileContent)>
<cfset myResult = REfindNoCase('<div id="p473I">[\s\S]*?</div>', myDocument, 1, True)>
<cfoutput>#Mid(myDocument,myResult.pos[1],myResult.len[1])#</cfoutput>


Was This Post Helpful? 0
  • +
  • -

#3 Craig328  Icon User is offline

  • I make this look good
  • member icon

Reputation: 1924
  • View blog
  • Posts: 3,462
  • Joined: 13-January 08

Re: Simple Screen Scrape

Posted 07 March 2011 - 02:36 PM

Well, first off...screen scraping...perhaps what you might consider doing is asking the owner of the information you're trying to get if they'll just give it to you in some format you can more easily use. In fact, there may already be an RSS feed for what you're looking for. Screen scraping CAN be quite unethical. For instance, if the info you're scraping is the product of someone else's hard work and you're grabbing it and repackaging it (and especially if you somehow make a profit from that) that's kinda low. Many times, the producer of the information will recognize that you COULD simply scrape it but if you approach him and ask for a data feed or maybe even ask for permission to scrape the info, he may simply give it to you or may ask you to at least mention the source of the info so he can get more traffic to his site. So, first things first: simply ask for it.

Otherwise, what you're doing already is most of what you need to do. You've pulled the text copy of the page you're wanting to use and storing that as a string in a variable. From there all you really need to do is to parse off the stuff that comes before "the marker" (the "p473I" piece) and then parsing everything else that comes after the end of the div (</div>) and then working with what's left. I don't know anyone that actually LIKES working with RegEx so if I were doing something like this, I'd stick to a bunch of FindNoCase and ReplaceNoCase functions til I got what I want.

But really, simply ask the guy first or see if he has an RSS feed you can tap into. Easier and more ethical than simply taking what you want just because you can.

Good luck!

Edit: In fact, on one of my sites, I ran into someone who was screen scraping a lot of data and such that I had worked very hard to produce for some time. We're talking 6 months of work involving running queries against public database APIs to produce a merged, calculated product...in other words, my brains and sweat with publicly accessible data. Anyway, this turd decided than rather than even ask me, he'll just set up a constant looping request to each page I had and take what he wanted. This is 900,000+ pages though...and I was bent.

I actually went into my code and made some of the data into on-the-fly images (using a blank .gif and CFIMAGE) as well as dynamically assigning div classes and names. I'd also vary spacing and such and insert totally random HTML comments (<!-- -->) and once I launched all that (luckily all done on one page) the scraping efforts ceased after a day or two when the asshole realized he'd never be able to write an automated algorithm to keep up with my obfuscation efforts.

I also eventually banned his entire IP address range. Nobody I was writing the site for would be in India...so it was no great loss.

This post has been edited by Craig328: 07 March 2011 - 02:43 PM

Was This Post Helpful? 0
  • +
  • -

#4 SleepingInChapel  Icon User is offline

  • D.I.C Head

Reputation: 5
  • View blog
  • Posts: 59
  • Joined: 02-January 09

Re: Simple Screen Scrape

Posted 07 March 2011 - 03:17 PM

View PostCraig328, on 07 March 2011 - 03:36 PM, said:

Edit: In fact, on one of my sites, I ran into someone who was screen scraping a lot of data and such that I had worked very hard to produce for some time. We're talking 6 months of work involving running queries against public database APIs to produce a merged, calculated product...in other words, my brains and sweat with publicly accessible data. Anyway, this turd decided than rather than even ask me, he'll just set up a constant looping request to each page I had and take what he wanted. This is 900,000+ pages though...and I was bent.

I actually went into my code and made some of the data into on-the-fly images (using a blank .gif and CFIMAGE) as well as dynamically assigning div classes and names. I'd also vary spacing and such and insert totally random HTML comments (<!-- -->) and once I launched all that (luckily all done on one page) the scraping efforts ceased after a day or two when the asshole realized he'd never be able to write an automated algorithm to keep up with my obfuscation efforts.

I also eventually banned his entire IP address range. Nobody I was writing the site for would be in India...so it was no great loss.


Wow, sorry to hear that happened to you! And I definitely hear ya, people should ask before using your stuff. Your obfuscation efforts actually make a pretty funny story... lol...

In this case, all I was doing was cloning the div with our terms and conditions and placing it in another template. The master copy is in a flat html page on our portal site (with a different domain, hence the disallowed cross-domain AJAX)... So not to worry, no authors/coders were offended, and no work was plagiarized by the coding of this page!
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1