12 Replies - 642 Views - Last Post: 27 September 2017 - 05:52 AM Rate Topic: -----

#1 JapanDave  Icon User is offline

  • D.I.C Regular

Reputation: 29
  • View blog
  • Posts: 366
  • Joined: 01-February 16

There has to be a better way to convert Unicode Characters? (RegEx)

Posted 19 September 2017 - 05:36 PM

I am trying to convert double byte integers back to single byte using RegEx. But, the way I am doing it just seems to verbose.

This is the code I am using right now, and for 10 intergers it is not that bad, but I will also have to convert double byte letters from the alphabet etc.

static void Main(string[] args)
        {
            var str = "0123456789";

            Console.WriteLine(ConvertChars(str));
            Console.ReadLine();

        }

        public static string ConvertChars(string convertChars)
        {
            foreach (var item in convertChars)
            {
                string result;
                switch (item)
                {
                    case '0':
                        result = Regex.Replace(convertChars, @"\uFF10", "0");
                        convertChars = result;
                        break;
                    case '1':
                        result = Regex.Replace(convertChars, @"\uFF11", "1");
                        convertChars = result;
                        break;
                    case '2':
                        result = Regex.Replace(convertChars, @"\uFF12", "2");
                        convertChars = result;
                        break;
                    case '3':
                        result = Regex.Replace(convertChars, @"\uFF13", "3");
                        convertChars = result;
                        break;
                    case '4':
                        result = Regex.Replace(convertChars, @"\uFF14", "4");
                        convertChars = result;
                        break;
                    case '5':
                        result = Regex.Replace(convertChars, @"\uFF15", "5");
                        convertChars = result;
                        break;
                    case '6':
                        result = Regex.Replace(convertChars, @"\uFF16", "6");
                        convertChars = result;
                        break;
                    case '7':
                        result = Regex.Replace(convertChars, @"\uFF17", "7");
                        convertChars = result;
                        break;
                    case '8':
                        result = Regex.Replace(convertChars, @"\uFF18", "8");
                        convertChars = result;
                        break;
                    case '9':
                        result = Regex.Replace(convertChars, @"\uFF19", "9");
                        convertChars = result;
                        break;
                }
            }
            return convertChars;
        }


Or am I stuck with something like this?

This post has been edited by JapanDave: 19 September 2017 - 05:40 PM


Is This A Good Question/Topic? 0
  • +

Replies To: There has to be a better way to convert Unicode Characters? (RegEx)

#2 Damage  Icon User is online

  • Lord of Schwing
  • member icon

Reputation: 284
  • View blog
  • Posts: 1,961
  • Joined: 05-June 08

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 19 September 2017 - 07:36 PM

this may or may not be helpful, have you had a look at the static bitconvertor
Was This Post Helpful? 0
  • +
  • -

#3 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5897
  • View blog
  • Posts: 20,136
  • Joined: 05-May 12

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 19 September 2017 - 07:55 PM

The RegEx is overkill for what you need to do.

If you know the relationship of the fullwidth character to the halfwidth versions you could just compute the corresponding halfwidth versions. For example:
string ConvertChars(string s)
{
    var sb = new StringBuilder(s.Length);
    foreach(var chIn in s)
    {
        char ch = chIn;
        if ('\uFF10' <= ch  && ch <= '\uFF19')
        {
            ch = (char)(ch - '\uFF10' + '0');
        }

        sb.Append(ch);
    }
    return sb.ToString();
}



Hint, if you compare the ASCII table with the Unicode table of fullwidth forms, the code above can handle all the ASCII printing characters simply by changing lines 7 and 9.
Was This Post Helpful? 1
  • +
  • -

#4 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5897
  • View blog
  • Posts: 20,136
  • Joined: 05-May 12

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 19 September 2017 - 08:10 PM

A version with no magic numbers:
string ConvertChars(string s)
{
    const char FullWidthZero = '\uFF10';
    const char AsciiZero = '0';
    const char FullWidthNine = '\uFF19';
    const char MinFullWidth = FullWidthZero;
    const char MaxFullWidth = FullWidthNine;
    const char MinHalfWidth = AsciiZero;

    var sb = new StringBuilder(s.Length);
    foreach(var chIn in s)
    {
        char ch = chIn;
        if (MinFullWidth <= ch  && ch <= MaxFullWidth)
        {
            ch = (char)(ch - MinFullWidth + MinHalfWidth);
        }

        sb.Append(ch);
    }
    return sb.ToString();
}



To do the full ASCII printable characters, just change lines 3-8.
Was This Post Helpful? 1
  • +
  • -

#5 tlhIn`toq  Icon User is offline

  • Xamarin Cert. Dev.
  • member icon

Reputation: 6507
  • View blog
  • Posts: 14,373
  • Joined: 02-June 10

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 20 September 2017 - 02:46 PM

View PostJapanDave, on 19 September 2017 - 07:36 PM, said:

I am trying to convert double byte integers back to single byte {...}


In otherwords from Unicode to ASCII? I think that's just encoding
https://stackoverflo...nicode-to-ascii



string text = "your text";
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, Encoding.Unicode.GetBytes(text));

Was This Post Helpful? 1
  • +
  • -

#6 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5897
  • View blog
  • Posts: 20,136
  • Joined: 05-May 12

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 20 September 2017 - 03:10 PM

Actually, I originally took that as my first attack at the problem, but that doesn't work:
using System;
using System.IO;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

class Program
{
    static void Main(string[] args)
    {
        var wideString = "0123456789";

        byte[] originalUnicodeBytes = Encoding.Unicode.GetBytes(wideString);
        Console.WriteLine("originalUnicodeBytes: {0}", BitConverter.ToString(originalUnicodeBytes));

        byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, originalUnicodeBytes);
        Console.WriteLine("asciiBytes: {0}", BitConverter.ToString(asciiBytes));

        byte[] unicodeBytes = Encoding.Convert(Encoding.ASCII, Encoding.Unicode, asciiBytes);
        Console.WriteLine("unicodeBytes: {0}", BitConverter.ToString(unicodeBytes));

        string result = Encoding.Unicode.GetString(unicodeBytes);
        Console.WriteLine(result);
    }
}



Results in:
originalUnicodeBytes: 10-FF-11-FF-12-FF-13-FF-14-FF-15-FF-16-FF-17-FF-18-FF-19-FF
asciiBytes: 3F-3F-3F-3F-3F-3F-3F-3F-3F-3F
unicodeBytes: 3F-00-3F-00-3F-00-3F-00-3F-00-3F-00-3F-00-3F-00-3F-00-3F-00
??????????


Was This Post Helpful? 0
  • +
  • -

#7 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5897
  • View blog
  • Posts: 20,136
  • Joined: 05-May 12

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 20 September 2017 - 03:20 PM

And UTF8 doesn't work either... :)Posted Image
Was This Post Helpful? 0
  • +
  • -

#8 tlhIn`toq  Icon User is offline

  • Xamarin Cert. Dev.
  • member icon

Reputation: 6507
  • View blog
  • Posts: 14,373
  • Joined: 02-June 10

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 20 September 2017 - 04:36 PM

What do you mean 'doesn't work'? Looks right to me. What am I missing?
Was This Post Helpful? 0
  • +
  • -

#9 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5897
  • View blog
  • Posts: 20,136
  • Joined: 05-May 12

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 20 September 2017 - 05:37 PM

Look closely at the glyphs shown in the message box. Fullwidth number glyphs are not the same at the glyphs for the regular numbers.

Or try adding Console.WriteLine(result == "0123456789");

Or another way to is look at the originalUnicodeBytes and the unicodeBytes also in the screenshot. Notice that you end up with the exact same bytes that you started with. The '\uFF10' for the fullwidth '0' shows up as 10-FF in both byte arrays. The objective of the OP was for the 10-FF to become 30-00.
Was This Post Helpful? 0
  • +
  • -

#10 JapanDave  Icon User is offline

  • D.I.C Regular

Reputation: 29
  • View blog
  • Posts: 366
  • Joined: 01-February 16

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 22 September 2017 - 06:19 PM

Thanks all for the help. This is such a pain to deal with, they really need to sort something out narrow and wide characters.
Was This Post Helpful? 0
  • +
  • -

#11 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5897
  • View blog
  • Posts: 20,136
  • Joined: 05-May 12

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 23 September 2017 - 06:12 AM

<sarcasm>Yes! If English was good enough for Jeezus, it is good enough for me.</sarcasm>

:)
Was This Post Helpful? 1
  • +
  • -

#12 JapanDave  Icon User is offline

  • D.I.C Regular

Reputation: 29
  • View blog
  • Posts: 366
  • Joined: 01-February 16

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 26 September 2017 - 09:55 PM

LOL, Was the Bible even in English.... LOL :surrender:
Was This Post Helpful? 0
  • +
  • -

#13 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5897
  • View blog
  • Posts: 20,136
  • Joined: 05-May 12

Re: There has to be a better way to convert Unicode Characters? (RegEx)

Posted 27 September 2017 - 05:52 AM

To make things even more interesting, Aramaic was written from right to left. :)
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1