I have a Tab-Separated-Values data.tsv file that contains 3 columns (country name, area, and population). I would like to aggregate the data by geo regions such as North America, South America, etc. and since the region info is not in the file, I need to add it in from this webpage www.indexmundi.com/factbook/regions and merge it with my data.tsv to produce a file named 'data_withregion.tsv.
I know I need to use BeautifulSoup4 and urllib2 in my code and have done so with reading through the links on the first page to the second, but I'm not sure how to add the region names into my file and merge it. (hopefully that makes sense)
here's a snippet of my current tsv file
country area population MACAU 28.2 578025 MONACO 2 30510 SINGAPORE 697 5353494 HONG KONG 1104 7153519 GAZA STRIP 360 1710257 GIBRALTAR 6.5 29034 HOLY SEE (VATICAN CITY) 0.44 836 BAHRAIN 760 1248348 MALDIVES 298 394451 MALTA 316 409836 BERMUDA 54 69080 SINT MAARTEN 34 39088 BANGLADESH 143998 161083804 ..........
my code:
import urllib2, re
from bs4 import BeautifulSoup
response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read()
soup = BeautifulSoup(response)
row = soup.findAll('li')
for link in row:
href = link.find('a')['href']
url = "http://www.indexmundi.com"
countryurl = url + href
response = urllib2.urlopen(countryurl).read()
soup = BeautifulSoup(response)
data_table = soup.findAll('td')
for data in data_table:
region = data.find('a')['href']
print region
and what I want my final tsv file to look like:
country region area population AFGHANISTAN Asia 652230 30419928 ALBANIA Europe 28748 3002859 ALGERIA Africa 2381741 37367226 AMERICAN SAMOA Oceania 199 54947 ANDORRA Europe 468 85082 ANGOLA Africa 1246700 18056072 ANGUILLA Central America & the Caribbean 91 15423 ANTIGUA AND BARBUDA Central America & the Caribbean 442.6 89018 ARGENTINA South America 2780400 42192494 ARMENIA Asia 29743 2970495 ARUBA Central America & the Caribbean 180 107635 AUSTRALIA Oceania 7741220 22015576 AUSTRIA Europe 83871 8219743 AZERBAIJAN Asia 86600 9493600 .............
I don't think I need to keep reading into the links from where I'm at right? But then I'm not sure how to merge the regions into the file with the correct country and order it like the above.
I'd appreciate any help!

New Topic/Question
Reply



MultiQuote






|