Getting and Merging data into tsv file help

  • (2 Pages)
  • +
  • 1
  • 2

19 Replies - 3049 Views - Last Post: 16 March 2013 - 05:44 PM Rate Topic: -----

#1 jellyworms  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 15-March 13

Getting and Merging data into tsv file help

Posted 15 March 2013 - 12:43 AM

Hi, I just started learning python a few days ago...and I'm already stuck on something easy TT.TT

I have a Tab-Separated-Values data.tsv file that contains 3 columns (country name, area, and population). I would like to aggregate the data by geo regions such as North America, South America, etc. and since the region info is not in the file, I need to add it in from this webpage www.indexmundi.com/factbook/regions and merge it with my data.tsv to produce a file named 'data_withregion.tsv.

I know I need to use BeautifulSoup4 and urllib2 in my code and have done so with reading through the links on the first page to the second, but I'm not sure how to add the region names into my file and merge it. (hopefully that makes sense)

here's a snippet of my current tsv file
country	area	population 
MACAU	28.2	578025 
MONACO	2	30510 
SINGAPORE	697	5353494 
HONG KONG	1104	7153519 
GAZA STRIP	360	1710257 
GIBRALTAR	6.5	29034 
HOLY SEE (VATICAN CITY)	0.44	836 
BAHRAIN	760	1248348 
MALDIVES	298	394451 
MALTA	316	409836 
BERMUDA	54	69080 
SINT MAARTEN	34	39088 
BANGLADESH	143998	161083804
.......... 

my code:
import urllib2, re
from bs4 import BeautifulSoup


response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read()
soup = BeautifulSoup(response)
row = soup.findAll('li')
for link in row:
    href = link.find('a')['href']
    url = "http://www.indexmundi.com"
    countryurl = url + href
    response = urllib2.urlopen(countryurl).read()
    soup = BeautifulSoup(response)
    data_table = soup.findAll('td')
    for data in data_table:
        region = data.find('a')['href']
        print region

and what I want my final tsv file to look like:
country	region	area	population
AFGHANISTAN	Asia	652230	30419928
ALBANIA	Europe	28748	3002859
ALGERIA	Africa	2381741	37367226
AMERICAN SAMOA	Oceania	199	54947
ANDORRA	Europe	468	85082
ANGOLA	Africa	1246700	18056072
ANGUILLA	Central America & the Caribbean	91	15423
ANTIGUA AND BARBUDA	Central America & the Caribbean	442.6	89018
ARGENTINA	South America	2780400	42192494
ARMENIA	Asia	29743	2970495
ARUBA	Central America & the Caribbean	180	107635
AUSTRALIA	Oceania	7741220	22015576
AUSTRIA	Europe	83871	8219743
AZERBAIJAN	Asia	86600	9493600
.............

I don't think I need to keep reading into the links from where I'm at right? But then I'm not sure how to merge the regions into the file with the correct country and order it like the above.

I'd appreciate any help!

Is This A Good Question/Topic? 0
  • +

Replies To: Getting and Merging data into tsv file help

#2 ajit.nayak87  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 48
  • Joined: 11-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 12:48 AM

I would like to know few things here.
1) how your receiving data
2)i didn't find the difference current csv and csv file ur looking for???
3)please elobrate


View Postjellyworms, on 15 March 2013 - 12:43 AM, said:

Hi, I just started learning python a few days ago...and I'm already stuck on something easy TT.TT

I have a Tab-Separated-Values data.tsv file that contains 3 columns (country name, area, and population). I would like to aggregate the data by geo regions such as North America, South America, etc. and since the region info is not in the file, I need to add it in from this webpage www.indexmundi.com/factbook/regions and merge it with my data.tsv to produce a file named 'data_withregion.tsv.

I know I need to use BeautifulSoup4 and urllib2 in my code and have done so with reading through the links on the first page to the second, but I'm not sure how to add the region names into my file and merge it. (hopefully that makes sense)

here's a snippet of my current tsv file
country	area	population 
MACAU	28.2	578025 
MONACO	2	30510 
SINGAPORE	697	5353494 
HONG KONG	1104	7153519 
GAZA STRIP	360	1710257 
GIBRALTAR	6.5	29034 
HOLY SEE (VATICAN CITY)	0.44	836 
BAHRAIN	760	1248348 
MALDIVES	298	394451 
MALTA	316	409836 
BERMUDA	54	69080 
SINT MAARTEN	34	39088 
BANGLADESH	143998	161083804
.......... 

my code:
import urllib2, re
from bs4 import BeautifulSoup


response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read()
soup = BeautifulSoup(response)
row = soup.findAll('li')
for link in row:
    href = link.find('a')['href']
    url = "http://www.indexmundi.com"
    countryurl = url + href
    response = urllib2.urlopen(countryurl).read()
    soup = BeautifulSoup(response)
    data_table = soup.findAll('td')
    for data in data_table:
        region = data.find('a')['href']
        print region

and what I want my final tsv file to look like:
country	region	area	population
AFGHANISTAN	Asia	652230	30419928
ALBANIA	Europe	28748	3002859
ALGERIA	Africa	2381741	37367226
AMERICAN SAMOA	Oceania	199	54947
ANDORRA	Europe	468	85082
ANGOLA	Africa	1246700	18056072
ANGUILLA	Central America & the Caribbean	91	15423
ANTIGUA AND BARBUDA	Central America & the Caribbean	442.6	89018
ARGENTINA	South America	2780400	42192494
ARMENIA	Asia	29743	2970495
ARUBA	Central America & the Caribbean	180	107635
AUSTRALIA	Oceania	7741220	22015576
AUSTRIA	Europe	83871	8219743
AZERBAIJAN	Asia	86600	9493600
.............

I don't think I need to keep reading into the links from where I'm at right? But then I'm not sure how to merge the regions into the file with the correct country and order it like the above.

I'd appreciate any help!

Was This Post Helpful? 0
  • +
  • -

#3 jellyworms  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 15-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 01:00 AM

Ah, sorry about that. The tsv file I would like to get - if you look closely (kinda hard to see), but the regions area (column next to country) is added. It's not exactly aligned for easy seeing.

country region area population
02 AFGHANISTAN Asia 652230 30419928
03 ALBANIA Europe 28748 3002859

Basically, it's not there in my current tsv file and I need to insert it.

I'm getting the regions name from reading the links here http://www.indexmund...actbook/regions
So, using urllib2, I'm reading each country name link and the page it opens up have a list of region names (which is what I want).

Hopefully I understood and answered your questions correctly.

Here's my current tsv file attached if that helps.

http://wikisend.com/...448684/data.tsv
Was This Post Helpful? 0
  • +
  • -

#4 ajit.nayak87  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 48
  • Joined: 11-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 01:06 AM

Let me know whther this your correct format you looking for or not??


View Postjellyworms, on 15 March 2013 - 01:00 AM, said:

Ah, sorry about that. The tsv file I would like to get - if you look closely (kinda hard to see), but the regions area (column next to country) is added. It's not exactly aligned for easy seeing.

country region area population
02 AFGHANISTAN Asia 652230 30419928
03 ALBANIA Europe 28748 3002859

Basically, it's not there in my current tsv file and I need to insert it.

I'm getting the regions name from reading the links here http://www.indexmund...actbook/regions
So, using urllib2, I'm reading each country name link and the page it opens up have a list of region names (which is what I want).

Hopefully I understood and answered your questions correctly.

Here's my current tsv file attached if that helps.

http://wikisend.com/...448684/data.tsv

Was This Post Helpful? 0
  • +
  • -

#5 jellyworms  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 15-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 01:12 AM

Yes, this is what I want (below), with the region names taken from the webpage and merged with the right country in my tsv file.

country region  area    population
AFGHANISTAN Asia    652230  30419928
ALBANIA Europe  28748   3002859
ALGERIA Africa  2381741 37367226
AMERICAN SAMOA  Oceania 199 54947
ANDORRA Europe  468 85082
ANGOLA  Africa  1246700 18056072

Was This Post Helpful? 0
  • +
  • -

#6 ajit.nayak87  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 48
  • Joined: 11-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 01:18 AM

Ok,
1)once you write completing all just copy ur whole text and paste it excel .you will get data format what i have given
2) if you don't want above method, i need snapshot of out put window for each instnace like for response and all.
where your writing in to text file.

View Postjellyworms, on 15 March 2013 - 01:12 AM, said:

Yes, this is what I want (below), with the region names taken from the webpage and merged with the right country in my tsv file.

country region  area    population
AFGHANISTAN Asia    652230  30419928
ALBANIA Europe  28748   3002859
ALGERIA Africa  2381741 37367226
AMERICAN SAMOA  Oceania 199 54947
ANDORRA Europe  468 85082
ANGOLA  Africa  1246700 18056072

Was This Post Helpful? 0
  • +
  • -

#7 jellyworms  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 15-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 01:24 AM

I'm sorry, I don't think I'm fully understanding what you're saying...?

did you mean snapshot of my python output?

and what did you mean by "data format what I have given"?
Was This Post Helpful? 0
  • +
  • -

#8 ajit.nayak87  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 48
  • Joined: 11-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 01:29 AM

1) you open ur tsv file . then copy the whole thing and paste it in excel sheet.

you will get in required file
2)instead of saving as tsv file, save file with csv so u get directly data in your format.

3)out put i asked is : what ever you write in tsv file ,there is python interpeter which shows us what does it written in to file.i need snap shot of that
4)how you are saving file in below program:(name of file u have being saving data





View Postjellyworms, on 15 March 2013 - 01:24 AM, said:

I'm sorry, I don't think I'm fully understanding what you're saying...?

did you mean snapshot of my python output?

and what did you mean by "data format what I have given"?

Was This Post Helpful? 0
  • +
  • -

#9 jellyworms  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 15-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 01:43 AM

I need to keep it as tsv file. I know that we use \t to give it the tab-separated value formatting. It's just how it shows up.

I haven't been saving anything yet. I've just been printing it in cmd to first see my output (which is just the list of region names I grabbed from the webpage - see below).

Algeria	
Angola	
Benin	
Botswana
Burkina 
Faso	
Burundi	Cameroon	
Cape Verde
Central African Republic	
Chad	
Comoros	Congo, Democratic Republic of the
Congo, Republic of the	
Cote d'Ivoire	
Djibouti	
Egypt
etc....


and I get the above output through this code:
import urllib2, re
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read()
soup = BeautifulSoup(response)
row = soup.findAll('li')
for link in row:
    href = link.find('a')['href']
    url = "http://www.indexmundi.com"
    countryurl = url + href
    response = urllib2.urlopen(countryurl).read()
    soup = BeautifulSoup(response)
    data_table = soup.findAll('td')
    for data in data_table:
        region = data.find('a').text
        print region


though I think I would somehow need to save the country name first so that when I do write the region names to my current tsv file, it will merge with the correct country it's under.
Was This Post Helpful? 0
  • +
  • -

#10 ajit.nayak87  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 48
  • Joined: 11-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 01:53 AM

I am trying to understand this.The file which share earlier and data output your getting are not matching.
for your code i would expect country \t region \t population.
but in your code it is printing country.tsv file shared correct or not.


have you tried this. copy your whole data from tsv file and paste it in excel file.
I have attached pdf file ,]. how it look like after changing tsv to excel file.












View Postjellyworms, on 15 March 2013 - 01:43 AM, said:

I need to keep it as tsv file. I know that we use \t to give it the tab-separated value formatting. It's just how it shows up.

I haven't been saving anything yet. I've just been printing it in cmd to first see my output (which is just the list of region names I grabbed from the webpage - see below).

Algeria	
Angola	
Benin	
Botswana
Burkina 
Faso	
Burundi	Cameroon	
Cape Verde
Central African Republic	
Chad	
Comoros	Congo, Democratic Republic of the
Congo, Republic of the	
Cote d'Ivoire	
Djibouti	
Egypt
etc....


and I get the above output through this code:
import urllib2, re
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read()
soup = BeautifulSoup(response)
row = soup.findAll('li')
for link in row:
    href = link.find('a')['href']
    url = "http://www.indexmundi.com"
    countryurl = url + href
    response = urllib2.urlopen(countryurl).read()
    soup = BeautifulSoup(response)
    data_table = soup.findAll('td')
    for data in data_table:
        region = data.find('a').text
        print region


though I think I would somehow need to save the country name first so that when I do write the region names to my current tsv file, it will merge with the correct country it's under.

Attached File(s)

  • Attached File  test.pdf (147.4K)
    Number of downloads: 47

Was This Post Helpful? 0
  • +
  • -

#11 jellyworms  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 15-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 02:00 AM

My python code is not complete. I am stuck where I am so that's why I need help. Right now, I only have it printing out all the region names and that's it. I don't need to get other stuff like country name, area, and population because I have that info (from the tsv file I attached earlier - same as your pdf). I just need to figure out how to write those region names to the tsv file that I have and have it write under the correct country.
Was This Post Helpful? 0
  • +
  • -

#12 ajit.nayak87  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 48
  • Joined: 11-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 02:22 AM

Thanks for the clarifying the statement.

First you need to do this.You always go by line by line. first check whether you giving proper response.use print statement.If you are not getting response itself . then it waste of time continuing further.

In below code using statement as (http://www.indexmundi.com/factbook/regions) so i taking only regions.

so first check ur response. put screen shot after result get printing






import urllib2, re
02	from bs4 import BeautifulSoup
03	 
04	response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read()
        print response
05	soup = BeautifulSoup(response)
06	row = soup.findAll('li')
07	for link in row:
08	    href = link.find('a')['href']
09	    url = "http://www.indexmundi.com"
10	    countryurl = url + href
11	    response = urllib2.urlopen(countryurl).read()
12	    soup = BeautifulSoup(response)
13	    data_table = soup.findAll('td')
14	    for data in data_table:
15	        region = data.find('a').text
16	        print region





View Postjellyworms, on 15 March 2013 - 02:00 AM, said:

My python code is not complete. I am stuck where I am so that's why I need help. Right now, I only have it printing out all the region names and that's it. I don't need to get other stuff like country name, area, and population because I have that info (from the tsv file I attached earlier - same as your pdf). I just need to figure out how to write those region names to the tsv file that I have and have it write under the correct country.

Was This Post Helpful? 0
  • +
  • -

#13 ajit.nayak87  Icon User is offline

  • New D.I.C Head

Reputation: -1
  • View blog
  • Posts: 48
  • Joined: 11-March 13

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 03:37 AM

I am trying to run your program but , it doesnot working.you should post your full code.

as i told you first check whether your getting proper response or not??





import urllib2, re
from bs4 import BeautifulSoup


View Postjellyworms, on 15 March 2013 - 02:00 AM, said:

My python code is not complete. I am stuck where I am so that's why I need help. Right now, I only have it printing out all the region names and that's it. I don't need to get other stuff like country name, area, and population because I have that info (from the tsv file I attached earlier - same as your pdf). I just need to figure out how to write those region names to the tsv file that I have and have it write under the correct country.

Was This Post Helpful? 0
  • +
  • -

#14 baavgai  Icon User is online

  • Dreaming Coder
  • member icon

Reputation: 5905
  • View blog
  • Posts: 12,809
  • Joined: 16-October 07

Re: Getting and Merging data into tsv file help

Posted 15 March 2013 - 04:48 AM

First, stop scraping that poor site all the time. Scrape it once and store the data.

Going by what you have:
import urllib2, re
from BeautifulSoup import BeautifulSoup

def getSoup(path):
	BASE_URL='http://www.indexmundi.com'
	response = urllib2.urlopen(BASE_URL + path).read()
	return BeautifulSoup(response)

def getRegionCountry():
	data = [ ]
	for region in getSoup('/factbook/regions').findAll('li'):
		regionName = region.text
		countriesSoup = getSoup(region.find('a')['href'])
		data.extend([ (regionName, e.find('a').text) for e in countriesSoup.findAll('td') ])
	return data

with open('RegionCountryLookup.py', 'w') as fh:
	fh.write('RegionCountryLookup = ' + str(getRegionCountry()))



This produces a file that looks like:
RegionCountryLookup = [(u'Africa', u'Algeria'), (u'Africa', u'Angola'), (u'Africa', u'Benin'), (u'Africa', u'Botswana'), (u'Africa', u'Burkina Faso'), (u'Africa', u'Burundi'), (u'Africa', u'Cameroon'), (u'Africa', u'Cape Verde'), (u'Africa', u'Central African Republic'), (u'Africa', u'Chad'), ...



You can now use that in your code. You can to a make a dictionary of that and do a lookup. e.g.
import RegionCountryLookup

# ...

# make a dictionary
lookup = dict((c.upper(), r) for r, c in RegionCountryLookup.RegionCountryLookup )

for line in dat:
	country, area, population = line.split('\t')
	if country in lookup:
		print "\t".join([country, lookup[country], area, population])
	else:
		print "\t".join([country, 'UNKNOWN', area, population])


This post has been edited by baavgai: 15 March 2013 - 04:48 AM

Was This Post Helpful? 1
  • +
  • -

#15 jellyworms  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 15-March 13

Re: Getting and Merging data into tsv file help

Posted 16 March 2013 - 03:18 AM

@baavgai - thanks, I played around with your code because I didn't want to save it into another py file or anything like that. But I'm struggling with the dictionary and lookup part. I get either and attribute error or values error.

This is my code now:
import urllib2, re
from bs4 import BeautifulSoup

def getSoup(path):
	BASE_URL='http://www.indexmundi.com'
	response = urllib2.urlopen(BASE_URL + path).read()
	return BeautifulSoup(response)

data = [ ]
for region in getSoup('/factbook/regions').findAll('li'):
	regionName = region.text
	countriesSoup = getSoup(region.find('a')['href'])
	data.extend([(regionName, e.find('a').text) for e in countriesSoup.findAll('td')])
#print str(data)
	
    
##########The code below doesn't work###########

# make a dictionary
lookup = dict((c.upper(), r) for r, c in data)

for line in open("data.tsv", "r"):
	country, area, population = line.split('\t')
	if country in lookup:
		print "\t".join([country, lookup[country], area, population])
	else:
		print "\t".join([country, 'UNKNOWN', area, population])

################################################


I hope someone can help me out with this part :)
Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2