7 Replies - 530 Views - Last Post: 10 May 2012 - 07:56 PM Rate Topic: -----

#1 Laggy  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 16
  • Joined: 10-February 11

Reading/Writing XML files encoded in a charset other than default

Posted 10 May 2012 - 11:41 AM

Hi there!

I have an XML file that is encoded in LATIN-1, and I'm trying to read it and manipulate it, then output it back in the same encoding. My problem is that the data that I read from the files isnt manipulatable, and then when I write the output to the new file, I get null characters for all the line breaks. If anyone has any ideas, they would be much appreciated!

Example:

File file = new File(filepath);
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file), "iso-8859-1"));

ArrayList output = new ArrayList();
String line = null;

while ((line = in.readLine()) != null){
     //heres where I get the first problem
     if (line.contains("<?xml"){
          System.out.println("XML Header, not copying");  //this never gets printed, was just a debug test
     }
     else {
          output.add(line);
          System.out.println(line);
     }
}

DoStuff(output);  //NYI because each arraylist node has unexpected values because the read isn't working properly

OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(new File(filepath)), "iso-8859-1"));

for (int i=0; i<output.size(); i++){
     out.write((String)output.get(i));
}




Heres the fire two lines of output I get in the console from line 14:

ÿþ< ? x m l   v e r s i o n = " 1 . 0 "   e n c o d i n g = " i s o - 8 8 5 9 - 1 " ? > 
 
 < ? x m l - s t y l e s h e e t   h r e f = " R e n d e r i n g / l o g . x s l "   t y p e = " t e x t / x s l " ? > 


It doesn't *really* matter what charset I output the file in. The XML renderer that runs the file seems to be able to handle any charset (I made a backup of the input file and manually changed its encoding via Notepad++ to test), but I just can't figure out why I'm getting bad data from the input.

This post has been edited by Laggy: 10 May 2012 - 11:48 AM


Is This A Good Question/Topic? 0
  • +

Replies To: Reading/Writing XML files encoded in a charset other than default

#2 g00se  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 2107
  • View blog
  • Posts: 8,770
  • Joined: 20-September 08

Re: Reading/Writing XML files encoded in a charset other than default

Posted 10 May 2012 - 02:00 PM

Please attach the input file as a text file
Was This Post Helpful? 0
  • +
  • -

#3 Laggy  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 16
  • Joined: 10-February 11

Re: Reading/Writing XML files encoded in a charset other than default

Posted 10 May 2012 - 02:18 PM

View Postg00se, on 10 May 2012 - 02:00 PM, said:

Please attach the input file as a text file


log.txt is the input.

text.txt is what I'm currently getting as my output (with the null chars).

Attached File(s)

  • Attached File  log.txt (758bytes)
    Number of downloads: 18
  • Attached File  test.txt (744bytes)
    Number of downloads: 18

Was This Post Helpful? 0
  • +
  • -

#4 g00se  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 2107
  • View blog
  • Posts: 8,770
  • Joined: 20-September 08

Re: Reading/Writing XML files encoded in a charset other than default

Posted 10 May 2012 - 02:50 PM

OK. Firstly, log.txt is not encoded as Latin1. It's also not encoded in the encoding it says it is (Cp1252 - which is not the same as Latin1 btw). It's actually encoded as UTF-16 and it has a byte order mark at the beginning
Was This Post Helpful? 1
  • +
  • -

#5 Laggy  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 16
  • Joined: 10-February 11

Re: Reading/Writing XML files encoded in a charset other than default

Posted 10 May 2012 - 03:18 PM

Oh yea I think I was playing around with something and just changed the text to say Cp1252 and accidentally uploaded the one I was messing with instead of the original. Original says "iso-8859-1", which was where I got the LATIN-1 from. I actually see that its encoded in USC-2 Little Endian now in notepad++. So that was just me being inattentive and not knowing that USC-2 LE wasn't the same as LATIN-1.

My question now: I can't seem to get new lines. I changed my out.write to out.write(output.get(i) + System.getProperty("line.seperator")) and that gives me a null character. I tried hardcoding it to out.write(output.get(i) + "\n\r") and that gives me a null character.

Edit: Nevermind, it works now. Thanks for the help!

This post has been edited by Laggy: 10 May 2012 - 03:19 PM

Was This Post Helpful? 0
  • +
  • -

#6 g00se  Icon User is online

  • D.I.C Lover
  • member icon

Reputation: 2107
  • View blog
  • Posts: 8,770
  • Joined: 20-September 08

Re: Reading/Writing XML files encoded in a charset other than default

Posted 10 May 2012 - 03:22 PM

Quote

System.getProperty("line.seperator")) 


Typo. That should be

System.getProperty("line.separator")) 


but only call that method once or better still use PrintWriter.println

This post has been edited by g00se: 10 May 2012 - 03:22 PM

Was This Post Helpful? 0
  • +
  • -

#7 macosxnerd101  Icon User is offline

  • Self-Trained Economist
  • member icon




Reputation: 9025
  • View blog
  • Posts: 33,462
  • Joined: 27-December 08

Re: Reading/Writing XML files encoded in a charset other than default

Posted 10 May 2012 - 07:18 PM

Also, your ArrayList should be declared using generics. Since you are storing Strings, you should use an ArrayList<String>. Otherwise, you will receive a deprecation warning. Generics handle issues with type-safety.
Was This Post Helpful? 1
  • +
  • -

#8 Laggy  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 16
  • Joined: 10-February 11

Re: Reading/Writing XML files encoded in a charset other than default

Posted 10 May 2012 - 07:56 PM

Oh yea, I didn't really think about that. I just typecast the nodes to (String) when I wanted to output them.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1