HTML Scraping
We need to import some modules first:
import urllib2 import urllib import json from BeautifulSoup import BeautifulSoup
Now let’s define a method that will return translated text:
def fromHtml(self, text, languageFrom, languageTo):
"""
Returns translated text that is scraped from Google Translate's HTML
source code.
"""
#We create a List of key:value so we can know which language code to use.
langCode={
"arabic":"ar", "bulgarian":"bg", "chinese":"zh-CN",
"croatian":"hr", "czech":"cs", "danish":"da", "dutch":"nl",
"english":"en", "finnish":"fi", "french":"fr", "german":"de",
"greek":"el", "hindi":"hi", "italian":"it", "japanese":"ja",
"korean":"ko", "norwegian":"no", "polish":"pl", "portugese":"pt",
"romanian":"ro", "russian":"ru", "spanish":"es", "swedish":"sv" }
#Set the user agent.
urllib.FancyURLopener.version = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008070400 SUSE/3.0.1-0.1 Firefox/3.0.1"
#Encode the parameters we're going to send to the Google servers.
try:
postParameters = urllib.urlencode({"langpair":"%s|%s" %(langCode[languageFrom.lower()],langCode[languageTo.lower()]), "text":text,"ie":"UTF8", "oe":"UTF8"})
except KeyError, error:
print "Currently we do not support %s" %(error.args[0])
return
#Send the request with the above parameters and save to 'page' variable.
page = urllib.urlopen("http://translate.google.com/translate_t", postParameters)
#content now contains the HTML source code of the website.
content = page.read()
#Don't forget to close the connection!
page.close()
Let's break this down.
First you create a dictionary to store the language symbols Google uses. Then we setup a user agent for our scraper. Then we encode our translation parameters to the query. Then we read the contents of the HTML to a local variable.
So far so good.
Now let's use BeautifulSoup to scrape what we need.
#content now contains the HTML source code of the website.
content = page.read()
htmlSource = BeautifulSoup(content)
#Google creates a span with title the same as the text you wanted to translate.
#So let's find a 'span' that has as a Title the 'text' we passed to this method.
translation = htmlSource.find('span', title=text )
#the renderContents() method returns the body that is inside of the span we found.
return translation.renderContents()
We use the .find() method to find a span that has the title of the text we searched. This is unique to Google's markup. It's just a matter of finding a pattern.
The .renderContents() method returns the inner contents of the tag.
We're done! The method will return translated text!
Official AJAX Response
Using the AJAX response is better in my opinion because you save bandwith by not having to download the complete source code of the site.
Here’s how you do it:
def fromAjax(self, text, languageFrom, languageTo):
"""
Returns a simple string translating the text from "languageFrom" to
"LanguageTo" using Google Translate AJAX Service.
"""
LANG={
"arabic":"ar", "bulgarian":"bg", "chinese":"zh-CN",
"croatian":"hr", "czech":"cs", "danish":"da", "dutch":"nl",
"english":"en", "finnish":"fi", "french":"fr", "german":"de",
"greek":"el", "hindi":"hi", "italian":"it", "japanese":"ja",
"korean":"ko", "norwegian":"no", "polish":"pl", "portugese":"pt",
"romanian":"ro", "russian":"ru", "spanish":"es", "swedish":"sv" }
base_url='http://ajax.googleapis.com/ajax/services/language/translate?'
langpair='%s|%s'%(LANG.get(languageFrom.lower(),languageFrom),
LANG.get(languageTo.lower(),languageTo))
params=urllib.urlencode( (('v',1.0),
('q',text.encode('utf-8')),
('langpair',langpair),) )
url=base_url+params
content=urllib2.urlopen(url).read()
try: trans_dict=json.loads(content)
except AttributeError:
try: trans_dict=json.load(content)
except AttributeError: trans_dict=json.read(content)
return trans_dict['responseData']['translatedText']
It's even easier! This AJAX request query returns the translated text without the need to scape and parse HTML using an external library. It Just Works™.
I hope this helps you learn a bit more about Python.
Thanks for reading and leave some feedback!

New Topic/Question



MultiQuote




|