wgets only downloads index.html

  • (2 Pages)
  • +
  • 1
  • 2

18 Replies - 1026 Views - Last Post: 31 May 2012 - 08:38 PM

#1 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 756
  • View blog
  • Posts: 1,990
  • Joined: 23-December 08

wgets only downloads index.html

Posted 30 May 2012 - 07:19 PM

Hey guys, I'm trying to download all the contents of this site. Googling suggested wgets, but I can't get it to download anything except the index.html file. I want to download everything at that directory or lower... what command do I put into wgets?
Is This A Good Question/Topic? 0
  • +

Replies To: wgets only downloads index.html

#2 jimblumberg  Icon User is offline

  • member icon


Reputation: 3846
  • View blog
  • Posts: 11,775
  • Joined: 25-December 09

Re: wgets only downloads index.html

Posted 31 May 2012 - 07:58 AM

Have you read the manual? Or did you try the help information: wget --help? What arguments did you supply? What operating system are you using?


Jim
Was This Post Helpful? 0
  • +
  • -

#3 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 756
  • View blog
  • Posts: 1,990
  • Joined: 23-December 08

Re: wgets only downloads index.html

Posted 31 May 2012 - 08:06 AM

I use windows, I have looked in the manual (couldn't get much out of it), and I've tried a multitude of arguments but here's an example of what I figured it should be:
wget -r http://www1.vcrlter...._ES_Lidar_v1_1/
Was This Post Helpful? 0
  • +
  • -

#4 jon.kiparsky  Icon User is online

  • Pancakes!
  • member icon


Reputation: 7298
  • View blog
  • Posts: 12,157
  • Joined: 19-March 11

Re: wgets only downloads index.html

Posted 31 May 2012 - 08:19 AM

Read the manual. "wget respects the Robot Exclusion Standard"


/gisdata is in the robots file


http://www.vcrlter.v....edu/robots.txt
Was This Post Helpful? 1
  • +
  • -

#5 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 756
  • View blog
  • Posts: 1,990
  • Joined: 23-December 08

Re: wgets only downloads index.html

Posted 31 May 2012 - 08:27 AM

ack, alright, thanks guys.

This post has been edited by atraub: 31 May 2012 - 08:35 AM

Was This Post Helpful? 0
  • +
  • -

#6 jon.kiparsky  Icon User is online

  • Pancakes!
  • member icon


Reputation: 7298
  • View blog
  • Posts: 12,157
  • Joined: 19-March 11

Re: wgets only downloads index.html

Posted 31 May 2012 - 08:38 AM

The robots file means that they don't want you to download everything automatically. That's kind of the point. If you don't want to respect that, you can probably find a way, but I'd suggest you try the "send email and ask the site owner" approach BEFORE you go into writing scripts with curl or suchlike tools.
Was This Post Helpful? 0
  • +
  • -

#7 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 756
  • View blog
  • Posts: 1,990
  • Joined: 23-December 08

Re: wgets only downloads index.html

Posted 31 May 2012 - 09:14 AM

there's an ignore robots flag :-P

very few people will actually be utilizing this data, so I think they can handle the bandwidth hit.
Was This Post Helpful? 0
  • +
  • -

#8 jon.kiparsky  Icon User is online

  • Pancakes!
  • member icon


Reputation: 7298
  • View blog
  • Posts: 12,157
  • Joined: 19-March 11

Re: wgets only downloads index.html

Posted 31 May 2012 - 09:23 AM

Nice to know you know their needs better than they do.
May you get the courtesy you give.
Was This Post Helpful? -1
  • +
  • -

#9 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 756
  • View blog
  • Posts: 1,990
  • Joined: 23-December 08

Re: wgets only downloads index.html

Posted 31 May 2012 - 09:59 AM

Regardless, I need all this information, does it make a difference if I go through and hit download a couple hundred times or if I have wgets do it recursively?

This post has been edited by atraub: 31 May 2012 - 10:02 AM

Was This Post Helpful? 0
  • +
  • -

#10 baavgai  Icon User is offline

  • Dreaming Coder
  • member icon

Reputation: 5643
  • View blog
  • Posts: 12,359
  • Joined: 16-October 07

Re: wgets only downloads index.html

Posted 31 May 2012 - 12:00 PM

It's more than just a request not to do crawl the site, it's also a warning.

Many data driven sites are little more than databases with wrappers. The URLs are intentionally dynamic and can change from page to page, even for the same location. Amazon does this, for instance. What this means is that a site crawler is likely to get lost, recursively pulling the entire site down forever.
Was This Post Helpful? 2
  • +
  • -

#11 jon.kiparsky  Icon User is online

  • Pancakes!
  • member icon


Reputation: 7298
  • View blog
  • Posts: 12,157
  • Joined: 19-March 11

Re: wgets only downloads index.html

Posted 31 May 2012 - 12:03 PM

@baavgai - Never argue with someone who warns you that they're going to drag you down to their level and beat you with their experience.
Was This Post Helpful? -1
  • +
  • -

#12 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 756
  • View blog
  • Posts: 1,990
  • Joined: 23-December 08

Re: wgets only downloads index.html

Posted 31 May 2012 - 05:14 PM

Thanks baavgai

Ahh, guess I'm doing it manually.

Incidentally, this is Lidar data taken of Hog Island which is located off the coast of Virginia. The Island was inhabited until the 20's until a hurricane hit and the shore line started rapidly eroding. Since then, the island hasn't had any real inhabitants because it's mostly sand and there's very little holding the shore in place. A researcher has been studying the island extensively for a few years now, gathering all the data he possibly can. He found a small shrub on the island that has a decent root system and represents hope of making the land more firm and thus viable. That researcher was the driving force behind the gathering of this Lidar data. He is also the guy who reached out to my school and has asked that someone help him with analyzing the data using a GIS. I am the only guy helping with the analysis of that data.

Go ahead and make your shitty condescending remarks, I have nothing to apologize for.

This post has been edited by atraub: 31 May 2012 - 05:41 PM

Was This Post Helpful? 0
  • +
  • -

#13 baavgai  Icon User is offline

  • Dreaming Coder
  • member icon

Reputation: 5643
  • View blog
  • Posts: 12,359
  • Joined: 16-October 07

Re: wgets only downloads index.html

Posted 31 May 2012 - 05:44 PM

Honestly, you can send the owner an email to that effect. They'll probably say yes or even send you the data zipped up.

I used to have a site on yurts; one of the only ones back in the day. The data made it to odd ball places. Once got a request to put it on a Russian education CD.

If you put your info on the web, it's already public. Unless there's a technical or legal reason issue, site owners are often happy to see their work is making the rounds.
Was This Post Helpful? 2
  • +
  • -

#14 atraub  Icon User is offline

  • Pythoneer
  • member icon

Reputation: 756
  • View blog
  • Posts: 1,990
  • Joined: 23-December 08

Re: wgets only downloads index.html

Posted 31 May 2012 - 06:04 PM

I'll give it a try, thanks.
Was This Post Helpful? 0
  • +
  • -

#15 jon.kiparsky  Icon User is online

  • Pancakes!
  • member icon


Reputation: 7298
  • View blog
  • Posts: 12,157
  • Joined: 19-March 11

Re: wgets only downloads index.html

Posted 31 May 2012 - 06:19 PM

Quote

Go ahead and make your shitty condescending remarks, I have nothing to apologize for


Dude, it's your sig. If you don't like it, change it.

Quote

That researcher was the driving force behind the gathering of this Lidar data. He is also the guy who reached out to my school and has asked that someone help him with analyzing the data using a GIS. I am the only guy helping with the analysis of that data.


In that case, why so resistant to just asking for access?
I mean, if you're the guy this data is for, you should be on the server, not schlepping this stuff manually.

This post has been edited by jon.kiparsky: 31 May 2012 - 06:21 PM

Was This Post Helpful? -1
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2