Download an Entire Website including Search Results

  • (2 Pages)
  • +
  • 1
  • 2

22 Replies - 3295 Views - Last Post: 24 June 2014 - 08:51 AM Rate Topic: -----

#1 Poppins586   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 09-June 14

Download an Entire Website including Search Results

Posted 09 June 2014 - 01:43 PM

I wrote some code to scrape through every inmate in two jails and download that information at this point, but I want to know if there's an easier way to do it with little to no troubleshooting. I can't build one application, move on to the next, and then go back to the original to make sure that it's working properly only to see that it's not.

I need something simple, preferably quick, and thorough for downloading an entire website, and I'd like to make it perform that operation twice a day. I'm good with Python so I posted it here in case I need to make a module.

Also, if that is possible, is there a way I can put it into an exe file that works? A lot of the time the applications that I convert to executables lack what they need to function properly (like launching a webdriver using selenium).

I can work on Windows or Linux at this point.

Is This A Good Question/Topic? 0
  • +

Replies To: Download an Entire Website including Search Results

#2 Poppins586   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 09-June 14

Re: Download an Entire Website including Search Results

Posted 09 June 2014 - 01:54 PM

Oh oops, I forgot the links.

Here are the two websites I need to download the information from.

jil.macombgov.org

http://www.waynecoun...Disclaimer.aspx

Then, I also need to download all the case information from this website-

http://macombcountym...ces/home.page.7

(That link hasn't been working for me all day)
Was This Post Helpful? 0
  • +
  • -

#3 Shadowys   User is offline

  • D.I.C Head
  • member icon

Reputation: 10
  • View blog
  • Posts: 64
  • Joined: 16-May 14

Re: Download an Entire Website including Search Results

Posted 13 June 2014 - 11:56 PM

You didn't use regex to do it, did you? :P/> Anyways just use urllib.request.urlretrieve(yoururl).read() to download the whole page.

This post has been edited by andrewsw: 19 June 2014 - 11:38 AM

Was This Post Helpful? 0
  • +
  • -

#4 alexr1090   User is offline

  • D.I.C Head
  • member icon

Reputation: 44
  • View blog
  • Posts: 126
  • Joined: 08-May 11

Re: Download an Entire Website including Search Results

Posted 14 June 2014 - 07:28 AM

So when you say 'that information' are you referring to the the information that comes up when you click 'more info' on an inmate?

I haven't researched much about the sites you want to scrape but I'm assuming they're grabbing this information from a database. It seems like if only you could figure out a way to download the database that it would be much less time consuming than going through each inmate with selenium. I'm not sure how to do that though.

Then there's also doing it that way and having multiple processes going after it. I've done something similar to your project before and never could get multiprocessing to work correctly.

We were going to do something very not pretty to get our results faster. Basically just going to use our own slave computers that would do nothing but scrape the site we were going for. Each slave, which I'll emphasize were owned by us, would have a particular percentage to scrape. So computer 1 would search the first 10% of pages, 2 would do the next 10%, etc. Each slave then sent it's results to a master, which would do whatever it is you want to do with the information you scraped.
Was This Post Helpful? 0
  • +
  • -

#5 Shadowys   User is offline

  • D.I.C Head
  • member icon

Reputation: 10
  • View blog
  • Posts: 64
  • Joined: 16-May 14

Re: Download an Entire Website including Search Results

Posted 14 June 2014 - 08:10 AM

View PostPoppins586, on 09 June 2014 - 01:43 PM, said:

I wrote some code to scrape through every inmate in two jails and download that information at this point, but I want to know if there's an easier way to do it with little to no troubleshooting. I can't build one application, move on to the next, and then go back to the original to make sure that it's working properly only to see that it's not.

I need something simple, preferably quick, and thorough for downloading an entire website, and I'd like to make it perform that operation twice a day. I'm good with Python so I posted it here in case I need to make a module.

Also, if that is possible, is there a way I can put it into an exe file that works? A lot of the time the applications that I convert to executables lack what they need to function properly (like launching a webdriver using selenium).

I can work on Windows or Linux at this point.

Btw cxfreeze module from pypi will detect your dependencies when freezing your python module into an exe. If you want to run them in parallel, use asnycio or subprocess.Popen for that, and beautiful soup for general parsing.
Was This Post Helpful? 0
  • +
  • -

#6 Poppins586   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 09-June 14

Re: Download an Entire Website including Search Results

Posted 19 June 2014 - 09:21 AM

View Postalexr1090, on 14 June 2014 - 07:28 AM, said:

So when you say 'that information' are you referring to the the information that comes up when you click 'more info' on an inmate?

I haven't researched much about the sites you want to scrape but I'm assuming they're grabbing this information from a database. It seems like if only you could figure out a way to download the database that it would be much less time consuming than going through each inmate with selenium. I'm not sure how to do that though.

Then there's also doing it that way and having multiple processes going after it. I've done something similar to your project before and never could get multiprocessing to work correctly.

We were going to do something very not pretty to get our results faster. Basically just going to use our own slave computers that would do nothing but scrape the site we were going for. Each slave, which I'll emphasize were owned by us, would have a particular percentage to scrape. So computer 1 would search the first 10% of pages, 2 would do the next 10%, etc. Each slave then sent it's results to a master, which would do whatever it is you want to do with the information you scraped.

Well Alex, I've done a lot of research on this and have consulted experts. They say that the method I'm using is probably the most efficient method known at the moment. There is a method that is more efficient, which would be just to hack into the system and download the information directly. If I'm not mistaken, the macombgov one is running on IIS and the waynecounty one is running on Apache. Neither of them use SQL. I know how to hack into SQL and by doing that I can extract the information almost immediately. Obviously there's some legal issues there, so yea, it's not pretty, but the information is technically public. If only I could do that twice a day I'd be the happiest dude in the world. At this point, I'm using 2 to 3 and soon up to 5 "computer slaves" to go in there and collect the information. At this point, they need some supervision, not much, but just enough to piss me off (heheh), so, with that in mind, I'm working on using selenium and only selenium to perform these tasks so that if we come across a problem with the site loading or firefox not responding, the program will know how to deal with that. You should take a look at this code, it is EPIC. I can't share it though because of contractual obligations.

When I say "that information" I'm referring to both the list of inmates and then the inmate details with little to no redundancy and capturing all their aliases to make the data searchable.

Quote

You didn't use regex to do it, did you? :P Anyways just use urllib.request.urlretrieve(yoururl).read() to download the whole page.


I believe that I've tried that as well as wget from the terminal. It downloads the page, but it doesn't download the data because it's using a javascript. Any website that utilizes a search is very difficult to just download and call it a day.

Quote

Btw cxfreeze module from pypi will detect your dependencies when freezing your python module into an exe. If you want to run them in parallel, use asnycio or subprocess.Popen for that, and beautiful soup for general parsing.


This information is probably going to be very useful for me, thank you.
Was This Post Helpful? 0
  • +
  • -

#7 modi123_1   User is offline

  • Suitor #2
  • member icon



Reputation: 15329
  • View blog
  • Posts: 61,444
  • Joined: 12-June 08

Re: Download an Entire Website including Search Results

Posted 19 June 2014 - 11:09 AM

Quote

here is a method that is more efficient, which would be just to hack into the system and download the information directly. If I'm not mistaken, the macombgov one is running on IIS and the waynecounty one is running on Apache. Neither of them use SQL. I know how to hack into SQL and by doing that I can extract the information almost immediately.

Just a warning to keep it on this side of legal there, l33t. That and verify both sites do not exclude scraping of data.
Was This Post Helpful? 0
  • +
  • -

#8 Poppins586   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 09-June 14

Re: Download an Entire Website including Search Results

Posted 19 June 2014 - 11:17 AM

I would never hack into a system without the permission of the admin. I remember hearing a guy say that data science is half business and half hacking. I don't agree with that statement, but after I heard it I looked into hacking.
Was This Post Helpful? 0
  • +
  • -

#9 modi123_1   User is offline

  • Suitor #2
  • member icon



Reputation: 15329
  • View blog
  • Posts: 61,444
  • Joined: 12-June 08

Re: Download an Entire Website including Search Results

Posted 19 June 2014 - 11:19 AM

Good story, but again - verify both sites are okay with you scraping data. Both have some fuzzy straps defined for 'use', but it certainly helps to check before you apply your "up to 5 'computer slaves'" for hitting a mess of sites.
Was This Post Helpful? 0
  • +
  • -

#10 Poppins586   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 09-June 14

Re: Download an Entire Website including Search Results

Posted 19 June 2014 - 11:34 AM

I'm kind of afraid to bring it up to them. I was hired under the premise that I'm working with data. We have a legal department to deal with this in case it becomes an issue. I hope that doesn't happen to be the case but, cool story huh? Thank you for the heads up mr moderator :D
Was This Post Helpful? 0
  • +
  • -

#11 modi123_1   User is offline

  • Suitor #2
  • member icon



Reputation: 15329
  • View blog
  • Posts: 61,444
  • Joined: 12-June 08

Re: Download an Entire Website including Search Results

Posted 19 June 2014 - 11:38 AM

If it were me on the line I would check and get verification. Sites can get really persnickety when they are slammed with automated requests.. especially low level city or county level.
Was This Post Helpful? 0
  • +
  • -

#12 ArtificialSoldier   User is offline

  • D.I.C Lover
  • member icon

Reputation: 2411
  • View blog
  • Posts: 7,382
  • Joined: 15-January 14

Re: Download an Entire Website including Search Results

Posted 19 June 2014 - 12:05 PM

I would make every effort to avoid violating the terms of service on a law enforcement website, especially if you live in that jurisdiction.

Quote

The Wayne County web portal is owned and operated by the Wayne County government and provided as a service to the public. The use of the content, images and logos on the Wayne County web portal on any other web site or networked computer environment is prohibited.

Was This Post Helpful? 0
  • +
  • -

#13 andrewsw   User is offline

  • never lube your breaks
  • member icon

Reputation: 6819
  • View blog
  • Posts: 28,255
  • Joined: 12-December 12

Re: Download an Entire Website including Search Results

Posted 19 June 2014 - 12:08 PM

You don't want to appear on the list of inmates..

This post has been edited by andrewsw: 19 June 2014 - 12:08 PM

Was This Post Helpful? 0
  • +
  • -

#14 Poppins586   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 17
  • Joined: 09-June 14

Re: Download an Entire Website including Search Results

Posted 19 June 2014 - 12:10 PM

We are aware that the information is copywritten and we are not using it for any commercial purposes. We are only using it for risk analysis and making predictions.
Was This Post Helpful? 0
  • +
  • -

#15 ArtificialSoldier   User is offline

  • D.I.C Lover
  • member icon

Reputation: 2411
  • View blog
  • Posts: 7,382
  • Joined: 15-January 14

Re: Download an Entire Website including Search Results

Posted 19 June 2014 - 12:18 PM

What does your risk analysis team tell you about scraping the entire jail database twice a day? Do you predict that they will notice that traffic?

Seriously, just contact them, tell them what you're doing, and ask if they can provide an API for you to download the entire data set. Maybe they'll do your job for you.
Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2