9 Replies - 723 Views - Last Post: 28 April 2013 - 01:45 PM Rate Topic: -----

#1 Orochimaru  Icon User is offline

  • New D.I.C Head

Reputation: 3
  • View blog
  • Posts: 47
  • Joined: 26-December 12

Unicode nightmare...

Posted 23 April 2013 - 03:31 AM

Hi,

I'm having trouble making BeautifulSoup4 handle Swedish special characters "åäö".

I tried to fix things by following Ned Batchelder's "Pro tip #1: Unicode sandwich" approach.
http://nedbatchelder...unipain.html#35 - (Tutorial version: http://nedbatchelder...xt/unipain.html )

I've checked so that my target website isn't lying to me about what encoding it is using. UTF-8 is what we should be dealing with on the edges of my program sandwich, and Unicode should be the core type of my sandwich.

But I'm doing something wrong with my decoding (indata) and encoding (outdata) steps. Help please. :helpsmilie:/>/>/> (See source code at the bottom of this post.)

$ curl -i http://www.polisen.se/Stockholms_lan/Aktuellt/Handelser/ | tee curl.txt

HTTP/1.1 200 OK
Connection: keep-alive
Date: Tue, 23 Apr 2013 10:08:47 GMT
Set-Cookie: ASP.NET_SessionId=0z43ej45e0v0jr45wvmq1t55; path=/; HttpOnly
Cache-Control: no-cache
Pragma: no-cache
Expires: -1
Content-Type: text/html; charset=utf-8
Content-Length: 118733
Vary: Accept-Encoding



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="sv" lang="sv">

<head id="ctl00_Head1"><title>
	Händelser - Aktuellt - www.polisen.se
</title>


# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib2
import re

# Incidents
url1 = "http://www.polisen.se/Stockholms_lan/Aktuellt/Handelser/"
url2 = "http://www.polisen.se/Stockholms_lan/Aktuellt/Handelser/Handelsearkiv/"
page = urllib2.urlopen(url1)
html = page.read()
enigma = html.decode('utf-8')

soup = BeautifulSoup(enigma)
#print soup.prettify()


#incidents = soup.find_all('a')
#incidents = soup.find_all(href=re.compile("Handelser\/Stockholm"))
#incidents = soup.find_all("a", text=re.compile(".*[Ff]ylleri"))
incidents = soup.find_all("a", text=re.compile(".*Rån"))

incidents_utf8 = incidents.encode('utf-8')


# $ python script | tee ./results.txt
for row in incidents_utf8:
#for row in incidents:
    print "%r" % row



This post has been edited by Orochimaru: 23 April 2013 - 03:36 AM


Is This A Good Question/Topic? 0
  • +

Replies To: Unicode nightmare...

#2 Orochimaru  Icon User is offline

  • New D.I.C Head

Reputation: 3
  • View blog
  • Posts: 47
  • Joined: 26-December 12

Re: Unicode nightmare...

Posted 24 April 2013 - 12:45 PM

Since I'm pretty sure what encoding my target website is giving me (utf-8). I tried to untangle things further by following "Pro tip #2: Know what you have" - Bytes or Unicode?

Display
$ python Rån_search_01.py 

url1 =

page =
<type 'instance'>

html =
<type 'str'>

enigma =
<type 'unicode'>

incidents =
<class 'bs4.element.ResultSet'>

<a href="/Stockholms_lan/Aktuellt/Handelser/Stockholms-lan/2013-04-24-0520-Rattfylleri-Solna/">2013-04-24 05:20, Rattfylleri, Solna</a>
type(row) =
<class 'bs4.element.Tag'>

<a href="/Stockholms_lan/Aktuellt/Handelser/Stockholms-lan/2013-04-21-0718-Trafikolycka-smitning-fran-Huddinge/">2013-04-21 07:18, Rattfylleri, Huddinge</a>
type(row) =
<class 'bs4.element.Tag'>

<a href="/Stockholms_lan/Aktuellt/Handelser/Stockholms-lan/2013-04-20-0647-Rattfylleri-Stockholms-lan/">2013-04-20 06:47, Rattfylleri, Stockholms län</a>
type(row) =
<class 'bs4.element.Tag'>



What I don't understand is why my output is able to give me the Swedish special character ä. When I use ASCII character search ".*[Ff]ylleri". But when I use Swedish special characters in my search ".*Rån" then I don't get any output at all?

Test code
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib2
import re

# Incidents
url1 = "http://www.polisen.se/Stockholms_lan/Aktuellt/Handelser/"
print "\nurl1 ="
#print type(url1)
#url2 = "http://www.polisen.se/Stockholms_lan/Aktuellt/Handelser/Handelsearkiv/"

page = urllib2.urlopen(url1)
print "\npage ="
print type(page)

html = page.read()
print "\nhtml ="
print type(html)

enigma = html.decode('utf-8')
print "\nenigma ="
print type(enigma)

soup = BeautifulSoup(enigma)
#print soup.prettify()

#incidents = soup.find_all('a')
#incidents = soup.find_all(href=re.compile("Handelser\/Stockholm"))
incidents = soup.find_all("a", text=re.compile(".*[Ff]ylleri"))
#incidents = soup.find_all("a", text=re.compile(".*Rån"))
print "\nincidents ="
print type(incidents)


#incidents_utf8 = incidents.encode('utf-8')

# $ python script | tee ./results.txt
#for row in incidents_utf8:
for row in incidents:
    print "\n%r\ntype(row) =" % row
    print type(row)


Was This Post Helpful? 0
  • +
  • -

#3 alexr1090  Icon User is offline

  • D.I.C Head
  • member icon

Reputation: 44
  • View blog
  • Posts: 125
  • Joined: 08-May 11

Re: Unicode nightmare...

Posted 25 April 2013 - 04:17 PM

Isn't dealing with encodings fun? This type of problem has happened to me before when I was trying to create a csv file based on unicode. Well anyway I'm a bit confused by what would solve your issue. If you told me what you were trying to do I may be able to help more. So do you want to get rid of unicode?
Was This Post Helpful? 1
  • +
  • -

#4 Orochimaru  Icon User is offline

  • New D.I.C Head

Reputation: 3
  • View blog
  • Posts: 47
  • Joined: 26-December 12

Re: Unicode nightmare...

Posted 26 April 2013 - 03:02 AM

Hi Alex,

This is what I want to do.

1.
From the Swedish Police website.
http://www.polisen.s...ellt/Handelser/

2.
I want to make a python script using BeautifulSoup4 to filter out specific incidents, so that I get a list of the incidents that I specify.
When I send this search pattern to BeautifulSoup then I get a list of Drunkenness incidents.
incidents = soup.find_all("a", text=re.compile(".*[Ff]ylleri"))

But when I send a search pattern to BeautifulSoup that contains a Swedish special character å then I don't get any output or Exception warnings at all. Rån is for getting a list of Robbery incidents.
incidents = soup.find_all("a", text=re.compile(".*Rån"))


3.

Quote

So do you want to get rid of unicode?

I don't know? :wacko:/>/> I just want my script to understand me when I send my Rån = Robbery search pattern to BeautifulSoup.
incidents = soup.find_all("a", text=re.compile(".*Rån"))
Was This Post Helpful? 0
  • +
  • -

#5 alexr1090  Icon User is offline

  • D.I.C Head
  • member icon

Reputation: 44
  • View blog
  • Posts: 125
  • Joined: 08-May 11

Re: Unicode nightmare...

Posted 26 April 2013 - 02:03 PM

well this was tough. I spent a while trying to figure this out. Turns out though that these links aren't written using the special character. The actual href in the link is written using Ran. So of course it wasn't finding them. Then of course I got confused when it came to finding out exactly how that find_all function was working so I rewrote a bit of the code to make it work. Anyway this is what I came up with. Hopefully it helps.

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import urllib2
import re
print u'\xe5'
# Incidents
url1 = "http://www.polisen.se/Stockholms_lan/Aktuellt/Handelser/"
url2 = "http://www.polisen.se/Stockholms_lan/Aktuellt/Handelser/Handelsearkiv/"
page = urllib2.urlopen(url1)
html = page.read()
enigma = html.decode('latin-1')
print type(enigma)

soup = BeautifulSoup(enigma)

incidents = []
print u".*R\xe5n"
#incidents = soup.find_all('a')
#incidents = soup.find_all(href=re.compile("Handelser\/Stockholm"))
#incidents = soup.find_all("a", text=re.compile(".*[Ff]ylleri"))
for link in soup.find_all('a'):
	if link.get('href') != None:
		if 'Ran' in link.get('href'):
			incidents.append(link.get('href'))
print incidents 

#incidents_utf8 = incidents.decode('utf-8')
print incidents

# $ python script | tee ./results.txt
#for row in incidents_utf8:
#for row in incidents:
  #  print "%r" % row




Now note some of this stuff I was using for testing purposes so it isn't necessary. That should be obvious enough when you see some of those print statements. Also note that the line 'if 'Ran' in link.get('href')' will cause issues if link says for instance IRan it will include that as true. Nevertheless I think this will be enough to get you started. Good luck
Was This Post Helpful? 1
  • +
  • -

#6 Orochimaru  Icon User is offline

  • New D.I.C Head

Reputation: 3
  • View blog
  • Posts: 47
  • Joined: 26-December 12

Re: Unicode nightmare...

Posted 28 April 2013 - 10:36 AM

Thanks Alex, that helped my project to start moving its gears again!
Was This Post Helpful? 0
  • +
  • -

#7 Orochimaru  Icon User is offline

  • New D.I.C Head

Reputation: 3
  • View blog
  • Posts: 47
  • Joined: 26-December 12

Re: Unicode nightmare...

Posted 28 April 2013 - 11:17 AM

The eagle has landed thank you! :tup:/>/>
[SOLVED]
Was This Post Helpful? 0
  • +
  • -

#8 alexr1090  Icon User is offline

  • D.I.C Head
  • member icon

Reputation: 44
  • View blog
  • Posts: 125
  • Joined: 08-May 11

Re: Unicode nightmare...

Posted 28 April 2013 - 11:40 AM

Sweet. Glad I could help. I'm actually doing something slightly similar to this for the court system where I live so I just had to help you. Now if you happen to be making csv files based on all this information that would be insane because that's exactly what I'm doing. Anyway again, glad I could help!
Was This Post Helpful? 1
  • +
  • -

#9 Orochimaru  Icon User is offline

  • New D.I.C Head

Reputation: 3
  • View blog
  • Posts: 47
  • Joined: 26-December 12

Re: Unicode nightmare...

Posted 28 April 2013 - 01:44 PM

Learning how to control Excel files and CSV files will be my next projects after this one.
Was This Post Helpful? 0
  • +
  • -

#10 alexr1090  Icon User is offline

  • D.I.C Head
  • member icon

Reputation: 44
  • View blog
  • Posts: 125
  • Joined: 08-May 11

Re: Unicode nightmare...

Posted 28 April 2013 - 01:45 PM

haha wow. Well it's not too hard but if you have a question about that post it!
Was This Post Helpful? 1
  • +
  • -

Page 1 of 1