escape conversion of UTF-8 codes to string in xDocument?

  • (2 Pages)
  • +
  • 1
  • 2

22 Replies - 626 Views - Last Post: 04 November 2017 - 04:59 AM Rate Topic: -----

#1 Whateva_  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 42
  • Joined: 28-August 16

escape conversion of UTF-8 codes to string in xDocument?

Posted 18 October 2017 - 09:04 AM

How do I tell the Xdocument method not to convert UTF-8 codes to string while loading/saving the contents of a xml file?
Here is a sample code of modifying a node content of a xml file using another file
XDocument doc = Xdocument.Load(@"C:\Users\Desktop\pnas_sample.txt");
			var node_value=(from x in doc.Descendants("label")
				where x.Ancestors("fig").Any()
				select x.Value).First();
			XDocument doc2=Xdocument.Load(@"C:\Users\Desktop\pnas_sample2.txt",LoadOptions.PreserveWhitespace);
			doc2.Descendants("title").First().Value=node_value;
			doc2.Save(@"C:\Users\Desktop\pnas_sample2.txt");
			Console.WriteLine("Done");
			Console.ReadLine();

Here is the xml file before modification
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Jos&#x00E9; Eduardo d&#x00F7;s Santos</author>
<title>XML Developer&#x0027;s Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications  with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,  an evil sorceress, and her own childhood to become queen  of the world.</description>
</book>
</catalog>

and this is after modification
<?xml version="1.0" encoding="utf-8"?>
<catalog>
<book id="bk101">
<author>José Eduardo d÷s Santos</author>
<title>Figure 1</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications  with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,  an evil sorceress, and her own childhood to become queen  of the world.</description>
</book>
</catalog>

The codes &#x00E9; and &#x00F7; are converted to é and ÷. How can I prevent this from happening? I don't want those values to change/convert...
I don't find anything in SaveOptions either?
Please help :helpsmilie:/>/>/>

This post has been edited by Whateva_: 18 October 2017 - 09:06 AM


Is This A Good Question/Topic? 0
  • +

Replies To: escape conversion of UTF-8 codes to string in xDocument?

#2 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5886
  • View blog
  • Posts: 20,094
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 18 October 2017 - 09:15 AM

I can't find the reference right now, but I read about this a few days ago while I was preparing my response to your other thread. The trick is to use the XmlReader and XmlWriter classes instead of passing in the filenames to XDocument directly. When you instantiate the XmlReader (and/or XmlWriter) you can set options to not do the XML charset conversions to Unicode.
Was This Post Helpful? 0
  • +
  • -

#3 Whateva_  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 42
  • Joined: 28-August 16

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 18 October 2017 - 09:35 AM

Can you show it how with some code...I've also searched the internet but could not get a proper answer that I was looking for (maybe because I'm a newbie).
And because of this conversion I'm afraid to use xml parsing method and thinking about regex to do things like this, at least it wont change my file coding that I do not want to change.
Was This Post Helpful? 0
  • +
  • -

#4 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5886
  • View blog
  • Posts: 20,094
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 18 October 2017 - 09:43 AM

But why would you be afraid of the XML CharRef being replaced with their equivalent UTF-8 encoding? Your XML file is marked as being UTF-8 encoded. The characters and are perfectly valid UTF-8 characters. There is no need to store them as CharRef's.
Was This Post Helpful? 0
  • +
  • -

#5 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5886
  • View blog
  • Posts: 20,094
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 18 October 2017 - 11:13 AM

Anyway, here is a quick and dirty approach. Please take this as a demonstration of how things could be done. It's not a well polished solution because of the poor choice of just deriving from StringWriter instead of actually implementing a proper TextWriter child class.
using System;
using System.IO;
using System.Text;
using System.Xml;
using System.Xml.Linq;

class CharRefStringWriter : StringWriter
{
    public override void Write(char value)
    {
        if (value > 127)
        {
            int intValue = value;
            this.Write($"&#x{intValue:x};");
        }
        else
        {
            base.Write(value);
        }
    }

    public override void Write(string value)
    {
        foreach (var ch in value)
            this.Write(ch);
    }

    public override void Write(char[] buffer)
    {
        this.Write(buffer, 0, buffer.Length);
    }

    public override void Write(char[] buffer, int index, int count)
    {
        while(count-- > 0)
            this.Write(buffer[index++]);
    }
}

class Program
{
    static void Main(string[] args)
    {
        string test = "Jos\xE9 Eduardo d\xF7s Santos";

        Console.WriteLine(test);

        var doc = new XDocument(new XElement("author", test));

        var charRefWriter = new CharRefStringWriter();
        var xmlTextWriter = new XmlTextWriter(charRefWriter);

        doc.Save(xmlTextWriter);

        Console.WriteLine(charRefWriter.ToString());
    }
}


Was This Post Helpful? 0
  • +
  • -

#6 Whateva_  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 42
  • Joined: 28-August 16

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 20 October 2017 - 11:38 PM

I've found another way to tackle my problem(well, kinda :balloon:/>, before loading the xml file escaping & to &amp), but it's deleting the xml declaration <?xml version="1.0" encoding="UTF-8"?> and modifying the DOCTYPE element. Can anyone help me sort this problem out :surrender:/>
Here's the code
{
			var basePath = textBox1.Text;
			string[] filesindirectory = Directory.GetFiles(basePath, "*.xml",SearchOption.AllDirectories);
			foreach (string fp in filesindirectory)
			{
				string file_content = escape_string(File.ReadAllText(fp), 0);
				
				XDocument doc = Xdocument.Load(file_content, LoadOptions.PreserveWhitespace);
				doc.Descendants("name").Elements().Remove();
				File.WriteAllText(fp, escape_string(doc.ToString(), 1).ToString());
			}
			MessageBox.Show("Done");
		}
		private static string escape_string (string input_string, int option)
		{
			switch (option)
			{
				case 0:
					return input_string.Replace("&", "&amp;").ToString();
				case 1:
					return input_string.Replace("&amp;", "&").ToString();
				default:
					return null;

			}
		}

Sample file before any program launch
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="jats-html.xsl"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with OASIS Tables v1.0 20120330//EN" "JATS-journalpublishing-oasis-article1.dtd">
<article article-type="proceedings" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table">
<front>
.....
<article-meta>
<article-id pub-id-type="doi">10.1117/12.2053778</article-id>
<title-group>
<article-title>Information content of the space-frequency filtering of blood plasma layers laser images in the diagnosis of pathological changes</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Ushenko</surname>
<given-names>A.G.</given-names>
</name>
<xref ref-type="aff" rid="a1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="cor1"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Boychuk</surname>
<given-names>T.M.</given-names>
</name>
<xref ref-type="aff" rid="a2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Mincer</surname>
<given-names>O.P.</given-names>
</name>
<xref ref-type="aff" rid="a2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bodnar</surname>
<given-names>G.B.</given-names>
</name>
<xref ref-type="aff" rid="a2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kushnerick</surname>
<given-names>L.Ya.</given-names>
</name>
<xref ref-type="aff" rid="a1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Savich</surname>
<given-names>V. O.</given-names>
</name>
<xref ref-type="aff" rid="a2"><sup>2</sup></xref>
</contrib>
<aff id="a1"><label><sup>1</sup></label>Optics and Publishing Department, Chernivtsi National University, 2 Kotsyubinsky Str., Chernivtsi, 58012, Ukraine</aff>
<aff id="a2"><label><sup>2</sup></label>Bukovinian State Medical University, Chernivtsi, 58012, Ukraine</aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><email>[email protected]</email></corresp>
</author-notes>
<pub-date>
<year>2013</year>
</pub-date>
....
<self-uri content-type="pdf" xlink:href="00059_psisdg9066_90661O.pdf"/>
<abstract>
<title>ABSTRACT</title>
<p>The bases of method of the space-frequency of the filtering phase allocation of blood plasma pellicle are given here. The model of the optical-anisotropic properties of the albumen chain of blood plasma pellicle with regard to linear and circular double refraction of albumen and globulin crystals is proposed.</p>
<p>Comparative researches of the effectiveness of methods of the direct polarized mapping of the azimuth images of blood plasma pcllicle layers and space-frequency polarimetry of the laser radiation transformed by divaricate and holelikc optical-anisotropic chains of blood plasma pellicles were held.</p>
<p>On the basis of the complex statistic, correlative and fracta.1 analysis of the filtered frcquency-dimensional polarizing azimuth maps of the blood plasma pellicles structure a set of criteria of the change of the double refraction of the albumen chains caused by the prostate cancer was traced and proved.</p>
</abstract>
....
</front>
</article>

The file after program run
<?xml-stylesheet type="text/xsl" href="jats-html.xsl"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with OASIS Tables v1.0 20120330//EN" "JATS-journalpublishing-oasis-article1.dtd"[]>
<article article-type="proceedings" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table">
<front>
.....
<article-meta>
<article-id pub-id-type="doi">10.1117/12.2053778</article-id>
<title-group>
<article-title>Information content of the space-frequency filtering of blood plasma layers laser images in the diagnosis of pathological changes</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>


</name>
<xref ref-type="aff" rid="a1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="cor1"/>
</contrib>
<contrib contrib-type="author">
<name>


</name>
<xref ref-type="aff" rid="a2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name>


</name>
<xref ref-type="aff" rid="a2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name>


</name>
<xref ref-type="aff" rid="a2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name>


</name>
<xref ref-type="aff" rid="a1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name>


</name>
<xref ref-type="aff" rid="a2"><sup>2</sup></xref>
</contrib>
<aff id="a1"><label><sup>1</sup></label>Optics and Publishing Department, Chernivtsi National University, 2 Kotsyubinsky Str., Chernivtsi, 58012, Ukraine</aff>
<aff id="a2"><label><sup>2</sup></label>Bukovinian State Medical University, Chernivtsi, 58012, Ukraine</aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><email>[email protected]</email></corresp>
</author-notes>
<pub-date>
<year>2013</year>
</pub-date>
....
<self-uri content-type="pdf" xlink:href="00059_psisdg9066_90661O.pdf"/>
<abstract>
<title>ABSTRACT</title>
<p>The bases of method of the space-frequency of the filtering phase allocation of blood plasma pellicle are given here. The model of the optical-anisotropic properties of the albumen chain of blood plasma pellicle with regard to linear and circular double refraction of albumen and globulin crystals is proposed.</p>
<p>Comparative researches of the effectiveness of methods of the direct polarized mapping of the azimuth images of blood plasma pcllicle layers and space-frequency polarimetry of the laser radiation transformed by divaricate and holelikc optical-anisotropic chains of blood plasma pellicles were held.</p>
<p>On the basis of the complex statistic, correlative and fracta.1 analysis of the filtered frcquency-dimensional polarizing azimuth maps of the blood plasma pellicles structure a set of criteria of the change of the double refraction of the albumen chains caused by the prostate cancer was traced and proved.</p>
</abstract>
....
</front>
</article>

Why is <?xml version="1.0" encoding="UTF-8"?> being deleted and why [] is added to the DOCTYPE declaration?

This post has been edited by Whateva_: 20 October 2017 - 11:39 PM

Was This Post Helpful? 0
  • +
  • -

#7 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5886
  • View blog
  • Posts: 20,094
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 05:26 AM

I can't get to a PC right now, but I suspect that the root cause is your call to Xdocument.ToString() instead of using Xdocument.Save().
Was This Post Helpful? 0
  • +
  • -

#8 Whateva_  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 42
  • Joined: 28-August 16

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 05:50 AM

Small typo in the coding, XDocument doc = Xdocument.Load(file_content, LoadOptions.PreserveWhitespace); should be XDocument doc = Xdocument.Parse(file_content, LoadOptions.PreserveWhitespace); in the above code sample. My bad :whistling:


I'm using Xdocument.Parse, do I need to do a Xdocument.Save()?
So I think using the Parse method might be the reason for the problem, but I can't use Xdocument.Load on file_content as its not the file but its contents in a string(I think :balloon: ) so if I use XDocument doc = Xdocument.Load(file_content, LoadOptions.PreserveWhitespace);
it gives an error System.ArgumentException: Illegal characters in path..

I need this UTF-8 code conversion problem solved pretty badly as almost every program I'm trying to build for manipulating xml will have this issue as every xml files contains these codes like &#x----; &amp; &lt; etc.
Was This Post Helpful? 0
  • +
  • -

#9 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5886
  • View blog
  • Posts: 20,094
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 06:13 AM

That is not UTF-8 conversion issue. Those are character entities that are being saved as characters that are within the legal set of characters allowed within XML for the particular encoding and context. This is allowed by the XML specification. It sounds to me like whatever is reading your files downstream of your file processing is the one that is not XML compliant.
Was This Post Helpful? 0
  • +
  • -

#10 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5886
  • View blog
  • Posts: 20,094
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 06:18 AM

Anyway to answer your question, yes you can still use Save() even though you used Parse() to populate the XDocument. The Save() method would be particularly useless if you could only save if you loaded. How would you create new documents if that were the case? Also notice in the code that I posted above where Save() was called in a XDocument that was just created in memory.
Was This Post Helpful? 0
  • +
  • -

#11 Whateva_  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 42
  • Joined: 28-August 16

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 06:23 AM

So how do I fix this?
Also, how is there not an option to disable converting character entities without user/programmer permission in an advanced method like XDocument?
Was This Post Helpful? 0
  • +
  • -

#12 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5886
  • View blog
  • Posts: 20,094
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 06:25 AM

Anyway, it sounds like somebody else on Stackoverflow had the same problem as you do and not have a solution either.
Was This Post Helpful? 0
  • +
  • -

#13 Whateva_  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 42
  • Joined: 28-August 16

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 06:36 AM

Yes.
That is where I got below method
private static string escape_string (string input_string, int option){
            switch (option)
            {
                case 0:
                    return input_string.Replace("&", "&amp;").ToString();
                case 1:
                    return input_string.Replace("&amp;", "&").ToString();
                default:
                    return null;

            }
        }

Why can't there be a simple solution to this problem, it is such a basic thing? Why would a method simply change/convert my data without me telling it to do so is beyond me...
I guess I have to revert back to regex for this type of things :nervous:
Was This Post Helpful? 0
  • +
  • -

#14 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5886
  • View blog
  • Posts: 20,094
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 06:55 AM

In your shoes, I would tell the downstream folks that it is 2017, not 1998 anymore. They need to be XML compliant.

Of course if that downstream client is a big company like Oracle that still cannot get XML right even after they supposedly started from a brand new code base (ahem, WebLogic Content Server, ahem) and has an army of lawyers that will sue your ass if you publicly disclose their bugs, then I would talk to my manager. I would say that it will take me about a week and a half to write and fully unit test a custom XmlTextWrier. My salary would be cheaper than the lawyers he would have to hire to fight the downstream folks lawyers. Even if I opened a service request or defect ticket against the downstream components, and we waited for them to fix the issue, he would still have to pay my salary while we waited. Might as well write the custom code while waiting so that you have a back up plan, unless you have guarantees that they will fix the issue on their end.
Was This Post Helpful? 0
  • +
  • -

#15 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5886
  • View blog
  • Posts: 20,094
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 07:27 AM

View PostWhateva_, on 21 October 2017 - 09:36 AM, said:

Why would a method simply change/convert my data without me telling it to do so is beyond me...

This is because the XML specification allows it. The specification essentially describes a tree of nodes where the nodes contain Unicode strings (with a small set of non printing Unicode points not allowed). Note: Unicode, not ASCII. It prescribes the how those nodes should be written into a file and respects the encodings for that file. It prescribes how Unicode code points in those strings need to be written into the file need to written as character references only when when it would cause parsing issues. Otherwise, when there is no parsing issue, implements are free to save things as they want.

XML was meant to be human readable. (I tend to disagree with the designers definition of human readable. Ever tried to read an XML document that used namespaces fully?)

Imagine for a moment you lived in a country or locale that by default used non-ASCII characters: Chinese, Japanese, Thai, Arabic, Sanskrit, Cyrillic, Cherokee, Elvish, Dwarvish, Klingon, etc. When you open an XML document, wouldn't you want to read the strings in your native writing system instead of a series of character references? Ex. &#x00afe4;&#x0158;
Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2