11 Replies - 9964 Views - Last Post: 07 February 2011 - 09:53 AM Rate Topic: -----

#1 wisam abbasi  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 72
  • Joined: 12-December 09

Reading files using python

Posted 24 January 2011 - 09:29 AM

Hi
I have a problem when trying to read files that have Arabic language characters stored in using Python
Can anyone help please?
Is This A Good Question/Topic? 0
  • +

Replies To: Reading files using python

#2 Simown  Icon User is offline

  • Blue Sprat
  • member icon

Reputation: 319
  • View blog
  • Posts: 650
  • Joined: 20-May 10

Re: Reading files using python

Posted 24 January 2011 - 09:49 AM

The files are stored with Unicode encoding?

Then I suggest:
Reading and Writing Unicode Data
Was This Post Helpful? 2
  • +
  • -

#3 Motoma  Icon User is offline

  • D.I.C Addict
  • member icon

Reputation: 452
  • View blog
  • Posts: 796
  • Joined: 08-June 10

Re: Reading files using python

Posted 24 January 2011 - 09:50 AM

Could you post your code? You may need to specify that you want to read your file in binary mode: myfile = open('filename.txt', 'rb')
Was This Post Helpful? 1
  • +
  • -

#4 wisam abbasi  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 72
  • Joined: 12-December 09

Re: Reading files using python

Posted 06 February 2011 - 08:36 AM

View PostSimown, on 24 January 2011 - 09:49 AM, said:

The files are stored with Unicode encoding?

Then I suggest:
Reading and Writing Unicode Data

Yes they are, I've tried to store them with ANSI encoding and it worked. The problem now is how to change the encoding using Python not the editor????
Thanks for help

View PostMotoma, on 24 January 2011 - 09:50 AM, said:

Could you post your code? You may need to specify that you want to read your file in binary mode: myfile = open('filename.txt', 'rb')

input = open('wisam3.txt', 'r')
text=input.read()



Thanks
Was This Post Helpful? 0
  • +
  • -

#5 Nallo  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 163
  • View blog
  • Posts: 255
  • Joined: 19-July 09

Re: Reading files using python

Posted 07 February 2011 - 05:17 AM

When you read a text file you need to tell Python which encoding was used:

wisamfile = open('wisam3.txt', 'r')
text = wisamfile.read()
wisamfile.close()

#Text is a bytestring. If there are non-ascii characters in
#Python cannot know what they mean.
#So you need to tell python which codec to use.
#Lets turn it into a unicode string,
#asuming the file was encoded in utf-8:
text_unicode = unicode(text, 'utf-8')

#now do something with your unicode text
print text_unicode

#lets save it with a different encoding (iso8859_6)
#Just for showing. I suggest you stick to utf-8 when working with unicode
text_iso8859_6 = text_unicode.encode('iso8859_6')
wisam_arab = open('wisam3_arabencoding.txt', 'w')
wisam_arab.write(text_iso8859_6)
wisam_arab.close()


Was This Post Helpful? 0
  • +
  • -

#6 wisam abbasi  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 72
  • Joined: 12-December 09

Re: Reading files using python

Posted 07 February 2011 - 07:48 AM

View PostNallo, on 07 February 2011 - 05:17 AM, said:

When you read a text file you need to tell Python which encoding was used:

wisamfile = open('wisam3.txt', 'r')
text = wisamfile.read()
wisamfile.close()

#Text is a bytestring. If there are non-ascii characters in
#Python cannot know what they mean.
#So you need to tell python which codec to use.
#Lets turn it into a unicode string,
#asuming the file was encoded in utf-8:
text_unicode = unicode(text, 'utf-8')

#now do something with your unicode text
print text_unicode

#lets save it with a different encoding (iso8859_6)
#Just for showing. I suggest you stick to utf-8 when working with unicode
text_iso8859_6 = text_unicode.encode('iso8859_6')
wisam_arab = open('wisam3_arabencoding.txt', 'w')
wisam_arab.write(text_iso8859_6)
wisam_arab.close()



Thanks alot, you were really helpfull but, can I know the text file encoding automatically using python without going back to open the file because I need that in my project?

Thanks again
Was This Post Helpful? 0
  • +
  • -

#7 Nallo  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 163
  • View blog
  • Posts: 255
  • Joined: 19-July 09

Re: Reading files using python

Posted 07 February 2011 - 08:03 AM

Unfortunately there is no way to be sure about the used encoding of a text file. This is not python specific it applies to any programming languagge. A text file after all is just a secuence of bytes. Only when you know the encoding used those bytes have meaning.

You could use some heuristics to check the text files and use the most likely encoding candidate. But it is unreliable and messy (and I am not an expert with this, so don't expect any advice from me)

As for arabic letters. Chances are decent that either iso8859_6 (a 1Byte per character encoding extending ascii) or utf-8 (the most used encoding for unicode) were used (again I am not an expert with that)
Was This Post Helpful? 0
  • +
  • -

#8 wisam abbasi  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 72
  • Joined: 12-December 09

Re: Reading files using python

Posted 07 February 2011 - 08:30 AM

View PostNallo, on 07 February 2011 - 08:03 AM, said:

Unfortunately there is no way to be sure about the used encoding of a text file. This is not python specific it applies to any programming languagge. A text file after all is just a secuence of bytes. Only when you know the encoding used those bytes have meaning.

You could use some heuristics to check the text files and use the most likely encoding candidate. But it is unreliable and messy (and I am not an expert with this, so don't expect any advice from me)

As for arabic letters. Chances are decent that either iso8859_6 (a 1Byte per character encoding extending ascii) or utf-8 (the most used encoding for unicode) were used (again I am not an expert with that)



Ok, but the code you've wrote raises an error when trying to create an Arabic text file
it prints the text on the screan but does not create the file
here is the error:

Traceback (most recent call last):
  File "C:\Users\one.omary\Desktop\Asummerizer.py", line 36, in <module>
    text_iso8859_6 = text_unicode.encode('iso8859_6')
  File "C:\Python25\lib\encodings\iso8859_6.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>

This post has been edited by atraub: 07 February 2011 - 09:13 AM
Reason for edit:: code tags for readability

Was This Post Helpful? 0
  • +
  • -

#9 Nallo  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 163
  • View blog
  • Posts: 255
  • Joined: 19-July 09

Re: Reading files using python

Posted 07 February 2011 - 09:13 AM

Sorry, I am at a loss here :dontgetit:

The feff is the byte order mark for encoded unicode files, but I dont understand why it shows up here.

Can you show me the whole code you used to produce that error? Obviously you merged my sniplet into your code but without seeing it I won't understand what went wrong.

And another question: Are you using Python 2.x or 3.x? I made my posts assuming Python 2.x. But dealing with strings is one of the few things that considerably changed in Python 3.x.
Was This Post Helpful? 0
  • +
  • -

#10 wisam abbasi  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 72
  • Joined: 12-December 09

Re: Reading files using python

Posted 07 February 2011 - 09:26 AM

View PostNallo, on 07 February 2011 - 09:13 AM, said:

Sorry, I am at a loss here :dontgetit:

The feff is the byte order mark for encoded unicode files, but I dont understand why it shows up here.

Can you show me the whole code you used to produce that error? Obviously you merged my sniplet into your code but without seeing it I won't understand what went wrong.

And another question: Are you using Python 2.x or 3.x? I made my posts assuming Python 2.x. But dealing with strings is one of the few things that considerably changed in Python 3.x.


Ok, here is the code:

from __future__ import division
import re
import math
from array import array


input = open('wisam4.txt', 'r')
text=input.read()
input.close()


text_unicode = unicode(text, 'utf-8')
print text_unicode
text_iso8859_6 = text_unicode.encode('iso8859_6')  

wisam_arab = open('wisam3_arabencoding.txt', 'w')  

wisam_arab.write(text_iso8859_6)  

wisam_arab.close() 



and here is the error:

Traceback (most recent call last):
  File "C:\Users\one.omary\Desktop\2.py", line 14, in <module>
    text_iso8859_6 = text_unicode.encode('iso8859_6')
  File "C:\Python25\lib\encodings\iso8859_6.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>



And I am using python 2.x
Was This Post Helpful? 0
  • +
  • -

#11 Nallo  Icon User is offline

  • D.I.C Regular
  • member icon

Reputation: 163
  • View blog
  • Posts: 255
  • Joined: 19-July 09

Re: Reading files using python

Posted 07 February 2011 - 09:44 AM

I googled around a bit and found the unicode FAQ.

It seems the first bytes of an encoded unicode text file may or may not contain the byte order marker. If it is there it indicates which encoding was used. It is not part of the text, just tells about the encoding.

If it is there you can strip it from the string to avoid the encoding error produced.

Sorry I am not all that helpful here.
Was This Post Helpful? 0
  • +
  • -

#12 wisam abbasi  Icon User is offline

  • D.I.C Head

Reputation: 3
  • View blog
  • Posts: 72
  • Joined: 12-December 09

Re: Reading files using python

Posted 07 February 2011 - 09:53 AM

View PostNallo, on 07 February 2011 - 09:44 AM, said:

I googled around a bit and found the unicode FAQ.

It seems the first bytes of an encoded unicode text file may or may not contain the byte order marker. If it is there it indicates which encoding was used. It is not part of the text, just tells about the encoding.

If it is there you can strip it from the string to avoid the encoding error produced.

Sorry I am not all that helpful here.


Thanks Nallo, you were really helpful
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1