escape conversion of UTF-8 codes to string in xDocument?

  • (2 Pages)
  • +
  • 1
  • 2

22 Replies - 638 Views - Last Post: 04 November 2017 - 04:59 AM Rate Topic: -----

#16 Whateva_  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 43
  • Joined: 28-August 16

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 08:12 AM

View PostSkydiver, on 21 October 2017 - 07:27 AM, said:

Imagine for a moment you lived in a country or locale that by default used non-ASCII characters: Chinese, Japanese, Thai, Arabic, Sanskrit, Cyrillic, Cherokee, Elvish, Dwarvish, Klingon, etc. When you open an XML document, wouldn't you want to read the strings in your native writing system instead of a series of character references? Ex. 꿤Ř

I'm not saying that it should not convert the character codes, all I'm saying that there should also be an option to disable it because different people have different requirements and a good program should allow various options to do so.
Was This Post Helpful? 0
  • +
  • -

#17 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5895
  • View blog
  • Posts: 20,126
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 09:00 AM

Unfortunately, the class design is such that the XDocument relies on an XmlReader and XmlWriter The XmlReader returns Unicode strings. And the XmlWriter writes out Unicode strings and only swapping in character references where absolutely required. If you want to roundtrip "safely", you'll likely have to implement a reader that remembers where it did a conversion, and a writer that queries the reader to figure out what to convert back.
Was This Post Helpful? 0
  • +
  • -

#18 Whateva_  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 43
  • Joined: 28-August 16

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 21 October 2017 - 09:19 PM

I think I've finally figured it out. It looks like I was overthinking too much for a simple process :lol:
Anyway, here is the code of the entire process(in case anyone has similar issues)
var basePath=textBox1.Text;
			//get all directories from basepath
			string[] filesindirectory = Directory.GetDirectories(basePath);
			//Loop through each parent directory and get each matching xml file from it
			List<string[]> newList = filesindirectory.Select(folder => (from item in Directory.GetDirectories(folder, "meta", SearchOption.AllDirectories)
			                                                            .Select(item => Directory.GetFiles(item, "*.xml"))
			                                                            .ToList()
			                                                            .SelectMany(x => x)
			                                                            let sx = Directory.GetDirectories(folder, "xml", SearchOption.AllDirectories)
			                                                            .Select(items => Directory.GetFiles(items, "*.xml"))
			                                                            .ToList()
			                                                            .SelectMany(s => s)
			                                                            .Any(s => Path.GetFileName(s) == Path.GetFileName(item))
			                                                            where sx
			                                                            select item).ToArray()
			                                                 .Concat((from xmlItem in Directory.GetDirectories(folder, "xml", SearchOption.AllDirectories)
			                                                          .Select(item => Directory.GetFiles(item, "*.xml"))
			                                                          .ToList()
			                                                          .SelectMany(xs => xs)
			                                                          let sx = Directory.GetDirectories(folder, "meta", SearchOption.AllDirectories)
			                                                          .Select(items => Directory.GetFiles(items, "*.xml"))
			                                                          .ToList()
			                                                          .SelectMany(sc => sc)
			                                                          .Any(sc => Path.GetFileName(sc) == Path.GetFileName(xmlItem))
			                                                          where sx
			                                                          select xmlItem).ToArray()))
				.Select(xmlFiles => xmlFiles.ToArray()).ToList();
			//loop through each element of the jagged array
			foreach (string[] path in newList)
			{
				for (int j = 0; j < path.Length / 2; j++)
				{
					File.WriteAllText(path[(path.Length / 2) + j], File.ReadAllText(path[(path.Length / 2) + j]).Replace("&","&amp;"));
					XDocument doc = Xdocument.Load(path[j]);
					string name = doc.Root.Element("Emp").Element("lbl").Value;
					XDocument doc2 = Xdocument.Load(path[(path.Length / 2) + j]);
					doc2.Root.Element("Employee").SetElementValue("label", name);
					doc2.Save(path[(path.Length / 2) + j]);
					File.WriteAllText(path[(path.Length / 2) + j], File.ReadAllText(path[(path.Length / 2) + j]).Replace("&amp;","&"));
				}
			}

I simply did a replace & with &amp; before the Xdocument.Load and replace &amp; with & after Xdocument.Save and works fine thus far. :bigsmile:
However I'm not sure if there are better ways to the those string.Replace to make this program more efficient :mellow:
Was This Post Helpful? 0
  • +
  • -

#19 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5895
  • View blog
  • Posts: 20,126
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 22 October 2017 - 05:14 PM

Congratulations on finding a solution for you specific problem. I have my doubts, but if it covers your specific case, then more power to you.

I don't understand your sudden concern about efficiency. Considering how inefficiently you are determining which meta and data XML files to read from and write to, why does it matter if you are inefficiently reading all of the data XML into memory as a string, replacing the all the "&" with "&amp;" to make another copy of the string memory, and then writing that string back in the file, only to turn around have the XDocument read the file. You should have just called Parse() passing in the replacement string instead of having to hit the disk again.
Was This Post Helpful? 0
  • +
  • -

#20 Whateva_  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 43
  • Joined: 28-August 16

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 22 October 2017 - 06:55 PM

View PostSkydiver, on 22 October 2017 - 05:14 PM, said:

You should have just called Parse() passing in the replacement string instead of having to hit the disk again.

How do I call Parse() passing in the replacement string?

Previously, when used xdocument.Parse(), <?xml version="1.0" encoding="UTF-8"?> was deleted and [] was added to the DOCTYPE declaration for some reason.

I'm a newbie, so I'm not familiar with different techniques to deal with issues or make my program more efficient, that is why my procedures are unpolished I guess.

But I'm trying my best to learn.

This post has been edited by Whateva_: 22 October 2017 - 06:56 PM

Was This Post Helpful? 0
  • +
  • -

#21 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5895
  • View blog
  • Posts: 20,126
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 22 October 2017 - 11:30 PM

Parse() did not take it away. It was your use of ToString() instead of Save() which did.

If you are new to C#, back off on the usage of LINQ until you get a sense of how each of those LINQ functions are implemented. Go review that other thread where the LINQ was converted over into for and foreach loops to see the amount of work being done.

View PostWhateva_, on 22 October 2017 - 09:55 PM, said:

How do I call Parse() passing in the replacement string?

How did you pass the replacement string to File.WriteAllText()? I get the sense that you are just throwing code on screen without understanding what you writing (or copying and pasting).
Was This Post Helpful? 0
  • +
  • -

#22 Skydiver  Icon User is offline

  • Code herder
  • member icon

Reputation: 5895
  • View blog
  • Posts: 20,126
  • Joined: 05-May 12

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 23 October 2017 - 03:02 PM

A demonstration that Parse() works:
using System;
using System.IO;
using System.Text;
using System.Xml;
using System.Xml.Linq;

class Program
{
    static void Main(string[] args)
    {
        string xml = 
 @"<?xml version=""1.0"" encoding=""UTF-8""?>
<?xml-stylesheet type=""text/xsl"" href=""jats-html.xsl"" ?>
<!DOCTYPE root>
<root>
  <child>data</child>
</root>
";

        Console.WriteLine(xml);

        var doc = Xdocument.Parse(xml);
        doc.DocumentType.InternalSubset = null;

        var stringWriter = new StringWriter();
        var xmlTextWriter = new XmlTextWriter(stringWriter);

        doc.Save(xmlTextWriter);

        Console.WriteLine(stringWriter.ToString());
    }
}



As for the DOCTYPE instruction elmement getting an empty subset added, see this Stackoverflow answer. I've implemented a quick and dirty implementation of that fix as line 23 above.
Was This Post Helpful? 0
  • +
  • -

#23 Whateva_  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 43
  • Joined: 28-August 16

Re: escape conversion of UTF-8 codes to string in xDocument?

Posted 04 November 2017 - 04:59 AM

string targetDirectory1 = textBox1.Text;
			string targetDirectory2 = textBox2.Text;
			string[] xmlFilesArray1 = Directory.GetFiles(targetDirectory1, "*.xml", SearchOption.AllDirectories);
			string[] xmlFilesArray2 = Directory.GetFiles(targetDirectory2, "*.xml", SearchOption.AllDirectories);

			
			foreach (string xmlFile in xmlFilesArray1)
			{
				var FileInfo1 = new FileInfo(xmlFile);
				string FileLocation1 = FileInfo1.FullName;
				string file_name = Path.GetFileName(FileLocation1);
				foreach (var xmlFile2 in xmlFilesArray2)
				{
					if (xmlFile2.Contains(file_name))
					{
						string path = Path.GetFullPath(xmlFile2);
                        string file_content = escape_string(File.ReadAllText(path), 0);
						XDocument doc = Xdocument.Parse(file_content, LoadOptions.PreserveWhitespace);
						var name = doc.Root.Element("publicationinfo").Element("copyrightgroup").Element("copyright").Element("year").Value;
						XDocument doc2 = Xdocument.Parse(escape_string(File.ReadAllText(FileLocation1), 0), LoadOptions.PreserveWhitespace);
						doc2.Descendants("copyright-year").First().Value=name;
                        File.WriteAllText(FileLocation1, escape_string(doc2.ToString(), 1));
					}
				}
			}
			MessageBox.Show("Process complete");
		}
        private static string escape_string(string input_string, int option)
        {
            switch (option)
            {
                case 0:
                    return input_string.Replace("&", "&amp;").ToString();
                case 1:
                    return input_string.Replace("&amp;", "&").ToString();
                default:
                    return null;

            }
        }


This post has been edited by Skydiver: 04 November 2017 - 05:33 AM
Reason for edit:: Put code in code tags.

Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2