7 Replies - 270 Views - Last Post: 25 July 2014 - 09:18 AM Rate Topic: -----

#1 dovah  Icon User is online

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 35
  • Joined: 05-July 14

Downloading web page source

Posted 23 July 2014 - 06:56 AM

I have several urls stored in a text file. I'm trying to do a script in order to download the selected pages, and create for each of them a text file in my local home folder, but I'm stuck on how to proceed... Asking for help!

Here's my (pseudo)“code” up to now:

import urllib.request
url = 'http://www.uniprot.org/uniprot/APBB1_HUMAN.txt'
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
src = page.readall()
print(src)
with open("query.txt", "w") as f: #writes a messed up file
	for x in src: 
		f.write(str(x))


Thanks in advance!

Is This A Good Question/Topic? 0
  • +

Replies To: Downloading web page source

#2 f0ssil  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • Posts: 8
  • Joined: 22-July 14

Re: Downloading web page source

Posted 23 July 2014 - 10:56 AM

You are trying to save urls which is essentially HTML into a txt file...Is there a specific reason for that?
Was This Post Helpful? 0
  • +
  • -

#3 andrewsw  Icon User is online

  • Fire giant boob nipple gun!
  • member icon

Reputation: 3220
  • View blog
  • Posts: 10,802
  • Joined: 12-December 12

Re: Downloading web page source

Posted 23 July 2014 - 11:23 AM

It is simpler than that, you don't need to Request the data or readall().
import urllib.request
url = 'http://www.uniprot.org/uniprot/APBB1_HUMAN.txt'
#req = urllib.request.Request(url)
page = urllib.request.urlopen(url)
#src = page.readall()
#print(src)
with open("query.txt", "w") as f: #writes a messed up file
	#for x in src: 
	#	f.write(str(x))
    for x in page:
        f.write(str(x))
        f.write('\n')

I haven't worked out how to remove the '\n' text at the end of each line (to replace them with newlines), I'll leave that as an exercise.

This post has been edited by andrewsw: 23 July 2014 - 11:23 AM

Was This Post Helpful? 0
  • +
  • -

#4 andrewsw  Icon User is online

  • Fire giant boob nipple gun!
  • member icon

Reputation: 3220
  • View blog
  • Posts: 10,802
  • Joined: 12-December 12

Re: Downloading web page source

Posted 23 July 2014 - 11:35 AM

Well, this works:
import urllib.request
url = 'http://www.uniprot.org/uniprot/APBB1_HUMAN.txt'
#req = urllib.request.Request(url)
page = urllib.request.urlopen(url)
#src = page.readall()
#print(src)
with open("query.txt", "w") as f: #writes a messed up file
	#for x in src: 
	#	f.write(str(x))
    for x in page:
        f.write(str(x)[:-3])
        f.write('\n')

although there must be an easier way to convert '\n' to an actual newline character, probably by replacing '\n' with '\\n' (?).

Yeah,
    for x in page:
        f.write(str(x).replace('\\n','\n'))

Was This Post Helpful? 1
  • +
  • -

#5 dovah  Icon User is online

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 35
  • Joined: 05-July 14

Re: Downloading web page source

Posted 23 July 2014 - 11:22 PM

Thank you andrewsw!! ^_^
Was This Post Helpful? 0
  • +
  • -

#6 dovah  Icon User is online

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 35
  • Joined: 05-July 14

Re: Downloading web page source

Posted 24 July 2014 - 12:51 AM

View Postf0ssil, on 23 July 2014 - 05:56 PM, said:

You are trying to save urls which is essentially HTML into a txt file...Is there a specific reason for that?


I have something like 17k webpages like this to look up, so better if I can find a prgrammatic way of doing this. :)

View Postandrewsw, on 23 July 2014 - 06:35 PM, said:

Well, this works:
import urllib.request
url = 'http://www.uniprot.org/uniprot/APBB1_HUMAN.txt'
#req = urllib.request.Request(url)
page = urllib.request.urlopen(url)
#src = page.readall()
#print(src)
with open("query.txt", "w") as f: #writes a messed up file
	#for x in src: 
	#	f.write(str(x))
    for x in page:
        f.write(str(x)[:-3])
        f.write('\n')

although there must be an easier way to convert '\n' to an actual newline character, probably by replacing '\n' with '\\n' (?).

Yeah,
    for x in page:
        f.write(str(x).replace('\\n','\n'))


Thank you for this answer, that's awesome!
But I have a little "problem", which can be fixed using some shell magic... but I wonder from where the 'b' at the beginning of each line (or the b' at the beginning of the very first line) come from ?

I tried to str(x).replace them with a ''... but I can't get this done! Any hint would be appreciated :D
Was This Post Helpful? 0
  • +
  • -

#7 andrewsw  Icon User is online

  • Fire giant boob nipple gun!
  • member icon

Reputation: 3220
  • View blog
  • Posts: 10,802
  • Joined: 12-December 12

Re: Downloading web page source

Posted 24 July 2014 - 04:17 AM

I don't see that b' when I open the resultant file, but I assume it indicates binary data.

As you will be downloading 17k of these text files you should contact that company. Firstly, I doubt that they will appreciate you hitting their server in this way. Second, they may be able to just "give" you a copy of these files, perhaps using the DBFetch service that they mention (even though their public statement says that this is limited to 200 requests).

From their license statement:

Quote

However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first.

By downloading 17k of these files you are creating a version of their database.

In any case, I suspect that there must be an easier way to download entire text files, rather than the line-by-line writing in my example.

This post has been edited by andrewsw: 24 July 2014 - 04:18 AM

Was This Post Helpful? 0
  • +
  • -

#8 f0ssil  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • Posts: 8
  • Joined: 22-July 14

Re: Downloading web page source

Posted 25 July 2014 - 09:18 AM

This should help with being nice to the server and still being able to quickly (read->async) download the text files...

import grequests
from requests.exceptions import Connectionerror
from io import open as iopen
from urlparse import urlsplit

url_list = ["http://www.uniprot.org/uniprot/O00213.txt"]

def download_images(url_list):
    rs = [grequests.get(u) for u in url_list]

    path = "--Your path for the folder--"
    file_name = [urlsplit(file_url)[2].split('/')[-1] for file_url in url_list]
    print file_name

    # size defines how many requests will be made concurrently
    # try not to swarm the server #BeNice :)/>/>/>
    for req_file, fname in zip(grequests.map(rs, size=50), file_name):
        try:
            print "%s downloaded successfully" % fname
            with iopen(path + fname, 'wb') as file:
                file.write(req_file.content)
        except Connectionerror as e:
            continue
            print "Request failed"

download_images(url_list)


Was This Post Helpful? 0
  • +
  • -

Page 1 of 1