Reading files using python
Page 1 of 111 Replies - 5316 Views - Last Post: 07 February 2011 - 09:53 AM
#1
Reading files using python
Posted 24 January 2011 - 09:29 AM
I have a problem when trying to read files that have Arabic language characters stored in using Python
Can anyone help please?
Replies To: Reading files using python
#2
Re: Reading files using python
Posted 24 January 2011 - 09:49 AM
#3
Re: Reading files using python
Posted 24 January 2011 - 09:50 AM
#4
Re: Reading files using python
Posted 06 February 2011 - 08:36 AM
Simown, on 24 January 2011 - 09:49 AM, said:
Yes they are, I've tried to store them with ANSI encoding and it worked. The problem now is how to change the encoding using Python not the editor????
Thanks for help
Motoma, on 24 January 2011 - 09:50 AM, said:
input = open('wisam3.txt', 'r')
text=input.read()
Thanks
#5
Re: Reading files using python
Posted 07 February 2011 - 05:17 AM
wisamfile = open('wisam3.txt', 'r')
text = wisamfile.read()
wisamfile.close()
#Text is a bytestring. If there are non-ascii characters in
#Python cannot know what they mean.
#So you need to tell python which codec to use.
#Lets turn it into a unicode string,
#asuming the file was encoded in utf-8:
text_unicode = unicode(text, 'utf-8')
#now do something with your unicode text
print text_unicode
#lets save it with a different encoding (iso8859_6)
#Just for showing. I suggest you stick to utf-8 when working with unicode
text_iso8859_6 = text_unicode.encode('iso8859_6')
wisam_arab = open('wisam3_arabencoding.txt', 'w')
wisam_arab.write(text_iso8859_6)
wisam_arab.close()
#6
Re: Reading files using python
Posted 07 February 2011 - 07:48 AM
Nallo, on 07 February 2011 - 05:17 AM, said:
wisamfile = open('wisam3.txt', 'r')
text = wisamfile.read()
wisamfile.close()
#Text is a bytestring. If there are non-ascii characters in
#Python cannot know what they mean.
#So you need to tell python which codec to use.
#Lets turn it into a unicode string,
#asuming the file was encoded in utf-8:
text_unicode = unicode(text, 'utf-8')
#now do something with your unicode text
print text_unicode
#lets save it with a different encoding (iso8859_6)
#Just for showing. I suggest you stick to utf-8 when working with unicode
text_iso8859_6 = text_unicode.encode('iso8859_6')
wisam_arab = open('wisam3_arabencoding.txt', 'w')
wisam_arab.write(text_iso8859_6)
wisam_arab.close()
Thanks alot, you were really helpfull but, can I know the text file encoding automatically using python without going back to open the file because I need that in my project?
Thanks again
#7
Re: Reading files using python
Posted 07 February 2011 - 08:03 AM
You could use some heuristics to check the text files and use the most likely encoding candidate. But it is unreliable and messy (and I am not an expert with this, so don't expect any advice from me)
As for arabic letters. Chances are decent that either iso8859_6 (a 1Byte per character encoding extending ascii) or utf-8 (the most used encoding for unicode) were used (again I am not an expert with that)
#8
Re: Reading files using python
Posted 07 February 2011 - 08:30 AM
Nallo, on 07 February 2011 - 08:03 AM, said:
You could use some heuristics to check the text files and use the most likely encoding candidate. But it is unreliable and messy (and I am not an expert with this, so don't expect any advice from me)
As for arabic letters. Chances are decent that either iso8859_6 (a 1Byte per character encoding extending ascii) or utf-8 (the most used encoding for unicode) were used (again I am not an expert with that)
Ok, but the code you've wrote raises an error when trying to create an Arabic text file
it prints the text on the screan but does not create the file
here is the error:
Traceback (most recent call last):
File "C:\Users\one.omary\Desktop\Asummerizer.py", line 36, in <module>
text_iso8859_6 = text_unicode.encode('iso8859_6')
File "C:\Python25\lib\encodings\iso8859_6.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>
This post has been edited by atraub: 07 February 2011 - 09:13 AM
Reason for edit:: code tags for readability
#9
Re: Reading files using python
Posted 07 February 2011 - 09:13 AM
The feff is the byte order mark for encoded unicode files, but I dont understand why it shows up here.
Can you show me the whole code you used to produce that error? Obviously you merged my sniplet into your code but without seeing it I won't understand what went wrong.
And another question: Are you using Python 2.x or 3.x? I made my posts assuming Python 2.x. But dealing with strings is one of the few things that considerably changed in Python 3.x.
#10
Re: Reading files using python
Posted 07 February 2011 - 09:26 AM
Nallo, on 07 February 2011 - 09:13 AM, said:
The feff is the byte order mark for encoded unicode files, but I dont understand why it shows up here.
Can you show me the whole code you used to produce that error? Obviously you merged my sniplet into your code but without seeing it I won't understand what went wrong.
And another question: Are you using Python 2.x or 3.x? I made my posts assuming Python 2.x. But dealing with strings is one of the few things that considerably changed in Python 3.x.
Ok, here is the code:
from __future__ import division
import re
import math
from array import array
input = open('wisam4.txt', 'r')
text=input.read()
input.close()
text_unicode = unicode(text, 'utf-8')
print text_unicode
text_iso8859_6 = text_unicode.encode('iso8859_6')
wisam_arab = open('wisam3_arabencoding.txt', 'w')
wisam_arab.write(text_iso8859_6)
wisam_arab.close()
and here is the error:
Traceback (most recent call last):
File "C:\Users\one.omary\Desktop\2.py", line 14, in <module>
text_iso8859_6 = text_unicode.encode('iso8859_6')
File "C:\Python25\lib\encodings\iso8859_6.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>
And I am using python 2.x
#11
Re: Reading files using python
Posted 07 February 2011 - 09:44 AM
It seems the first bytes of an encoded unicode text file may or may not contain the byte order marker. If it is there it indicates which encoding was used. It is not part of the text, just tells about the encoding.
If it is there you can strip it from the string to avoid the encoding error produced.
Sorry I am not all that helpful here.
#12
Re: Reading files using python
Posted 07 February 2011 - 09:53 AM
Nallo, on 07 February 2011 - 09:44 AM, said:
It seems the first bytes of an encoded unicode text file may or may not contain the byte order marker. If it is there it indicates which encoding was used. It is not part of the text, just tells about the encoding.
If it is there you can strip it from the string to avoid the encoding error produced.
Sorry I am not all that helpful here.
Thanks Nallo, you were really helpful
|
|

New Topic/Question
Reply




MultiQuote






|