14 Replies - 1566 Views - Last Post: 20 January 2010 - 05:09 AM

#1 abhijeet_dighe  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 30-July 09

Stop automated program hits

Posted 13 January 2010 - 10:12 PM

Hi

I am using robots.txt for stopping crawlers.
But how can I stop automated program hits to website?
I am using classic asp.

This post has been edited by abhijeet_dighe: 13 January 2010 - 10:12 PM

Is This A Good Question/Topic? 0
  • +

Replies To: Stop automated program hits

#2 no2pencil  Icon User is offline

  • Toubabo Koomi
  • member icon

Reputation: 5316
  • View blog
  • Posts: 27,220
  • Joined: 10-May 07

Re: Stop automated program hits

Posted 13 January 2010 - 10:13 PM

I don't think you can.

A request to load the site is a request to load the site. I'm not sure of any way to differentiate between a web browser & an automated hit. It sure would be interesting to see others ideas, but from what I know of webhosting, the site is going to load.
Was This Post Helpful? 0
  • +
  • -

#3 abhijeet_dighe  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 30-July 09

Re: Stop automated program hits

Posted 13 January 2010 - 10:38 PM

View Postno2pencil, on 13 Jan, 2010 - 09:13 PM, said:

I don't think you can.

A request to load the site is a request to load the site. I'm not sure of any way to differentiate between a web browser & an automated hit. It sure would be interesting to see others ideas, but from what I know of webhosting, the site is going to load.


I am looking for something like this:
http://geekswithblog.../09/103124.aspx
Was This Post Helpful? 0
  • +
  • -

#4 no2pencil  Icon User is offline

  • Toubabo Koomi
  • member icon

Reputation: 5316
  • View blog
  • Posts: 27,220
  • Joined: 10-May 07

Re: Stop automated program hits

Posted 13 January 2010 - 10:45 PM

You can bounce traffic with the header() command.

Or do you have a language of preference other than php?
Was This Post Helpful? 0
  • +
  • -

#5 abhijeet_dighe  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 30-July 09

Re: Stop automated program hits

Posted 13 January 2010 - 10:50 PM

View Postno2pencil, on 13 Jan, 2010 - 09:45 PM, said:

You can bounce traffic with the header() command.

Or do you have a language of preference other than php?


I am using classic ASP.
Was This Post Helpful? 0
  • +
  • -

#6 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Stop automated program hits

Posted 14 January 2010 - 04:25 AM

View Postabhijeet_dighe, on 14 Jan, 2010 - 05:38 AM, said:

I am looking for something like this:
http://geekswithblog.../09/103124.aspx

So many problems with that... Just off the top of my head:
  • As noted in the linked document, it tends to break the "Back" button (which is, IMO, an unforgivable sin)
  • Completely breaks your site for users who have Javascript disabled (7% of users, last I checked) or who are using plugins such as NoScript which only allow Javascript from whitelisted sites to execute (I've never seen stats on how common such plugins are, but I get the distinct impression that their usage is growing quickly)
  • There are plenty of techniques out there for building scripts which control a browser, allowing the script to access Javascript-requiring sites by having the browser execute the Javascript code on the script's behalf


A couple years back, I took on a project to do something along these lines for a major SEO/web hosting company. Although robots.txt and user-agent filtering were a piece of it, the most effective technique was log analysis. Scan the web server access log and examine the activity pattern from each IP address. Software will generally have very different access patterns than human users and, when bot-like patterns are detected, you can create temporary firewall rules to block further access from that IP address. It's still not perfect, but it's the only source of data which can't be falsified by the client.

(Well, OK, it can be falsified to a degree by spoofing the IP address, but that's only really useful for DOS attacks. A client which spoofs the IP won't receive the server's response, so it won't see the returned page.)
Was This Post Helpful? 0
  • +
  • -

#7 Lemur  Icon User is offline

  • Pragmatism over Dogma
  • member icon


Reputation: 1368
  • View blog
  • Posts: 3,455
  • Joined: 28-November 09

Re: Stop automated program hits

Posted 14 January 2010 - 09:49 AM

Make a bot trap. If a certain IP address visits a site so many times set it to auto ban it. Just make a page with that in there, but make sure it's not a high traffic area...
Was This Post Helpful? 0
  • +
  • -

#8 abhijeet_dighe  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 30-July 09

Re: Stop automated program hits

Posted 14 January 2010 - 09:53 PM

View PostLemur, on 14 Jan, 2010 - 08:49 AM, said:

Make a bot trap. If a certain IP address visits a site so many times set it to auto ban it. Just make a page with that in there, but make sure it's not a high traffic area...


Hi

I found this:
http://retrowebdev.b...n-asp-with.html
Should I go for this?

When I googled, I found very few links for classic ASP.
Can you suggest any other link which will work for classic ASP.
Was This Post Helpful? 0
  • +
  • -

#9 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Stop automated program hits

Posted 15 January 2010 - 03:35 AM

View PostLemur, on 14 Jan, 2010 - 04:49 PM, said:

Make a bot trap. If a certain IP address visits a site so many times set it to auto ban it. Just make a page with that in there, but make sure it's not a high traffic area...

Also a good idea; it's usually called a "honeypot", so you might have more luck googling it under that name. Honeypots don't have to be done at the system level, either. One common technique for catching bots which use robots.txt as a list of the "most interesting" links on a site is to create a honeypot URI and put it in robots.txt, but not link to it from anywhere - if anyone hits that URI, you know that they got it from robots.txt and decided to deliberately go where they've been explicitly told not to, so firewall 'em.

If you want to use a link from an actual page for trapping spiders, you can use CSS, HTML color attributes, and/or javascript manipulation to make the link invisible (e.g., white-on-white) and/or use a 1x1 pixel image instead of text for the link. Human users won't see it, so they won't click on it, but most bots/spiders/scrapers won't know any better and will follow it, so, again, add anyone following the link to your block list. (Note, though, that this approach may have SEO implications if that's important to you. I've heard many times that google penalizes sites that use 'invisible' text.)
Was This Post Helpful? 1

#10 abhijeet_dighe  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 30-July 09

Re: Stop automated program hits

Posted 18 January 2010 - 10:46 PM

View Postdsherohman, on 15 Jan, 2010 - 02:35 AM, said:

View PostLemur, on 14 Jan, 2010 - 04:49 PM, said:

Make a bot trap. If a certain IP address visits a site so many times set it to auto ban it. Just make a page with that in there, but make sure it's not a high traffic area...

Also a good idea; it's usually called a "honeypot", so you might have more luck googling it under that name. Honeypots don't have to be done at the system level, either. One common technique for catching bots which use robots.txt as a list of the "most interesting" links on a site is to create a honeypot URI and put it in robots.txt, but not link to it from anywhere - if anyone hits that URI, you know that they got it from robots.txt and decided to deliberately go where they've been explicitly told not to, so firewall 'em.

If you want to use a link from an actual page for trapping spiders, you can use CSS, HTML color attributes, and/or javascript manipulation to make the link invisible (e.g., white-on-white) and/or use a 1x1 pixel image instead of text for the link. Human users won't see it, so they won't click on it, but most bots/spiders/scrapers won't know any better and will follow it, so, again, add anyone following the link to your block list. (Note, though, that this approach may have SEO implications if that's important to you. I've heard many times that google penalizes sites that use 'invisible' text.)



Thank you for giving nice explanation.
Now it is clear to me, how to identify bots.

But I am now facing problem of how to block them?
I was thinking of using IP address to block them.
But I can't rely on ip addresses, because oftentimes they will be dynamically assigned by the ISP. Another problem with relying on ip address is that on some corporate, academic and other private networks as well as some ISPs, all computers are put behind a single ip address using a NAT. This means there could be dozens, hundreds or in extreme cases thousands of computers sharing the same ip address.

Another option was of cookies. But they can be disabled or cleared.

So what is the best way to block bots/ machines after I identify them as bot?
Was This Post Helpful? 0
  • +
  • -

#11 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Stop automated program hits

Posted 19 January 2010 - 03:28 AM

View Postabhijeet_dighe, on 19 Jan, 2010 - 05:46 AM, said:

But I am now facing problem of how to block them?
I was thinking of using IP address to block them.
But I can't rely on ip addresses, because oftentimes they will be dynamically assigned by the ISP.

That's really only an issue if you're creating permanent IP blocks. I use a program called "fail2ban" on my own servers which watches for potential attacks. When it spots something questionable, it blocks the offending IP address for (by default) 10 minutes. The vast, vast majority of attack bots will go away as soon as they see they've been blocked and never come back (or at least not within the next few days) - I've been running fail2ban for years on several domains and only once have I seen an attack which was still trying when the first 10 minute ban expired.

The bot-blocker I wrote for the hosting provider is a bit more aggressive, banning IP addresses for 3 days initially or for lower-volume bots and 30 days for extremely-high-volume bots or if an admin blocks the address manually. This blocker is specifically targeted at web spiders/scrapers, which may be more persistent and require longer blocks because of it, but I haven't done the research to prove that to be the case, I just used the durations requested by the client. I suspect that 10 minutes would be sufficient to ward off most spiders/scrapers as well.

And, of course, for customers on dynamic IPs, limiting the blocking period to 10 minutes, or even a few hours, means that the ISP's other customers won't be affected, as the block will expire before any other customers are likely to want to visit your site.

View Postabhijeet_dighe, on 19 Jan, 2010 - 05:46 AM, said:

Another problem with relying on ip address is that on some corporate, academic and other private networks as well as some ISPs, all computers are put behind a single ip address using a NAT. This means there could be dozens, hundreds or in extreme cases thousands of computers sharing the same ip address.

In my experience, if malware has infected one host on a NATted subnet and is staging an attack, then it's probably infected most of that subnet, so you need to block the whole thing to stop the attack anyhow.

If the issue is someone testing a bot that doesn't quite work right or something like that rather than a malware infection, then it will only be coming from one host on the private net, sure... but only blocking the IP address for 10 minutes will heavily mitigate any issues caused by the entire subnet losing access to your site. In many cases, they'll just think the connection is acting up and hit 'reload' without thinking about it, by which time the 10 minutes are likely to be up and it will work again.

You could work around this by using the combination of IP address and one or more of the HTTP request headers (User-Agent is the most obvious choice), but my opinion is that any block done for security reasons should be done at the firewall level before the content of the incoming packets is known, which pretty much leaves the IP address as the only available piece of information to work with. (Looking at the payload potentially allows for attacks against the software inspecting it. This isn't normally a major concern, but, if I've already blocked someone as a likely attacker, then I get extra paranoid about the possibility that anything they send me could be a new attack, possibly against a different part of my system.)
Was This Post Helpful? 1

#12 abhijeet_dighe  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 30-July 09

Re: Stop automated program hits

Posted 19 January 2010 - 05:45 AM

Thank you once again for giving detailed explanation.
I will work on your suggestions.
Was This Post Helpful? 0
  • +
  • -

#13 abhijeet_dighe  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 30-July 09

Re: Stop automated program hits

Posted 19 January 2010 - 05:45 AM

Thank you once again for giving detailed explanation.
I will work on your suggestions.
Was This Post Helpful? 0
  • +
  • -

#14 abhijeet_dighe  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 30-July 09

Re: Stop automated program hits

Posted 20 January 2010 - 04:54 AM

View Postdsherohman, on 19 Jan, 2010 - 02:28 AM, said:

View Postabhijeet_dighe, on 19 Jan, 2010 - 05:46 AM, said:

But I am now facing problem of how to block them?
I was thinking of using IP address to block them.
But I can't rely on ip addresses, because oftentimes they will be dynamically assigned by the ISP.

That's really only an issue if you're creating permanent IP blocks. I use a program called "fail2ban" on my own servers which watches for potential attacks. When it spots something questionable, it blocks the offending IP address for (by default) 10 minutes. The vast, vast majority of attack bots will go away as soon as they see they've been blocked and never come back (or at least not within the next few days) - I've been running fail2ban for years on several domains and only once have I seen an attack which was still trying when the first 10 minute ban expired.

The bot-blocker I wrote for the hosting provider is a bit more aggressive, banning IP addresses for 3 days initially or for lower-volume bots and 30 days for extremely-high-volume bots or if an admin blocks the address manually. This blocker is specifically targeted at web spiders/scrapers, which may be more persistent and require longer blocks because of it, but I haven't done the research to prove that to be the case, I just used the durations requested by the client. I suspect that 10 minutes would be sufficient to ward off most spiders/scrapers as well.

And, of course, for customers on dynamic IPs, limiting the blocking period to 10 minutes, or even a few hours, means that the ISP's other customers won't be affected, as the block will expire before any other customers are likely to want to visit your site.

View Postabhijeet_dighe, on 19 Jan, 2010 - 05:46 AM, said:

Another problem with relying on ip address is that on some corporate, academic and other private networks as well as some ISPs, all computers are put behind a single ip address using a NAT. This means there could be dozens, hundreds or in extreme cases thousands of computers sharing the same ip address.

In my experience, if malware has infected one host on a NATted subnet and is staging an attack, then it's probably infected most of that subnet, so you need to block the whole thing to stop the attack anyhow.

If the issue is someone testing a bot that doesn't quite work right or something like that rather than a malware infection, then it will only be coming from one host on the private net, sure... but only blocking the IP address for 10 minutes will heavily mitigate any issues caused by the entire subnet losing access to your site. In many cases, they'll just think the connection is acting up and hit 'reload' without thinking about it, by which time the 10 minutes are likely to be up and it will work again.

You could work around this by using the combination of IP address and one or more of the HTTP request headers (User-Agent is the most obvious choice), but my opinion is that any block done for security reasons should be done at the firewall level before the content of the incoming packets is known, which pretty much leaves the IP address as the only available piece of information to work with. (Looking at the payload potentially allows for attacks against the software inspecting it. This isn't normally a major concern, but, if I've already blocked someone as a likely attacker, then I get extra paranoid about the possibility that anything they send me could be a new attack, possibly against a different part of my system.)


Hi

I am now maintaining black list of IPs. So whenever any black listed IP accesses an ASP page, code within that page looks up in database for black list IPs and redirects to another page. In ASP pages: I can run server side code of blocking IPs.

But I also have many static HTML pages which I also dont want to crawl. So how can I block them from crawling static HTML pages, as my server side code wont be executed for html pages. So they will easily continue crawling static HTML pages.

Or will I have to do something at system / IIS level instead of blocking at application level.
Like .htaccess file in Apache.
Was This Post Helpful? 0
  • +
  • -

#15 dsherohman  Icon User is offline

  • Perl Parson
  • member icon

Reputation: 226
  • View blog
  • Posts: 654
  • Joined: 29-March 09

Re: Stop automated program hits

Posted 20 January 2010 - 05:09 AM

View Postabhijeet_dighe, on 20 Jan, 2010 - 11:54 AM, said:

But I also have many static HTML pages which I also dont want to crawl. So how can I block them from crawling static HTML pages, as my server side code wont be executed for html pages. So they will easily continue crawling static HTML pages.

Or will I have to do something at system / IIS level instead of blocking at application level.
Like .htaccess file in Apache.

Since the only code that executes on the server for static HTML page requests is code from the operating system and the web server, those are the only things which can react to enforce blocks.

At the web server level, updating .htaccess (or the IIS equivalent) would be the easiest way to do it. It would also be possible to write a custom apache module to intercept requests in an earlier stage of processing and kill the request if it's in, e.g., a blacklist database; if I were to do this, I would also put the logic for deciding when to add/remove blacklist entries into this module if possible, since that would ensure that it was applied to all URIs on the site. (I assume IIS has a plugin capability similar to apache modules, but I've never worked with IIS, so I don't know the details or even the names.)

At the operating system level, you could enforce the blacklist with temporary firewall rules, provided you have sufficient access to the server to modify firewall settings.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1