4 Replies - 3052 Views - Last Post: 30 June 2013 - 08:44 PM Rate Topic: -----

#1 diegosendra  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 24-June 13

How to detect UTF-8-based encoded strings

Posted 24 June 2013 - 01:06 PM

Hi

A customer of asked us to build him a multi-language based support VB6 scraper, for which we had the need to detect UTF-8 based encoded strings to decode it later for proper displaying in application UI. It's necessary to point out that this need arises based on VB6 limitations to natively support UTF-8 in its controls, contrary to what it happens in .NET where you can tell a control that it should expect UTF-8 encoding. VB6 natively supports ISO 8859-1 and/or Windows-1252 encodings only, for which textboxes, dropdowns, listview controls, others can't be defined to natively support/expect UTF-8 as you can do in .NET considering what we just explained; so we would see weird symbols such as é, è among others, making it a whole mess at the time of displaying.

So, next function contains whole UTF-8 encoded punctuation marks and symbols from languages like Spanish, Italian, German, Portuguese, French and others, based on an excellent UTF-8 based list we got from this link - Ref. http://home.telfort....06/utf8tbl.html

Basically, the function compares if each and one of the listed UTF-8 encoded sentences, separated by | (pipe) are found in our passed string making a substring search first. Whether it's not found, it makes an alternative ASCII value based search to get a match. Say, a string like "Societ" (Society in english) would return FALSE through calling isUTF8("Societ") while it would return TRUE when calling isUTF8("SocietÈ") since È is the UTF-8 encoded representation of .

Once you got it TRUE or FALSE, you can decode the string through DecodeUTF8() function for properly displaying it, a function we found somewhere else time ago and also included in this post.


Function isUTF8(ByVal ptstr As String)
    Dim tUTFencoded As String
    Dim tUTFencodedaux
    Dim tUTFencodedASCII As String
    Dim ptstrASCII As String
    Dim iaux, iaux2 As Integer
    Dim ffound As Boolean
    
    ffound = False
    ptstrASCII = ""
    
    For iaux = 1 To Len(ptstr)
        ptstrASCII = ptstrASCII & Asc(Mid(ptstr, iaux, 1)) & "|"
    Next
        
    tUTFencoded = "Ä|Å|Ç|É|Ñ|Ö|Ì|á||â|ä|ã|å|ç|é|è|ê|ë|í|ì|î|ï|ñ|ó|ò|ô|ö|õ|ú|ù|û|ü||°|¢|£|§|•|¶|ß|®|©|™|´|¨||Æ|Ø|∞|±|≤|≥|¥|µ|∂|∑|∏|π|∫|ª|º|Ω|æ|ø|¿|¡|¬|√|ƒ|≈|∆|«|»|…||À|Ã|Õ|Œ|œ|–|—|“|”|‘|’|÷|◊|ÿ|Ÿ|⁄|€|‹|›|fi|fl|‡|·|‚|„|‰|Â|Ú|Á|Ë|È|Í|Î|Ï|Ì|Ó|Ô||Ò|Ú|Û|Ù|ı|ˆ|˜|¯|˘|˙|˚|¸|˝|˛|ˇ" & _
                "|š|¦|²|³|¹|¼|½|¾|Ð|×|Ý|Þ|ð|ý|þ" & _
                "|∞|≤|≥|∂|∑|∏|π|∫|Ω|√|≈|∆|◊|⁄|fi|fl||ı|˘|˙|˚|˝|˛|ˇ"

    tUTFencodedaux = Split(tUTFencoded, "|")
    If UBound(tUTFencodedaux) > 0 Then
        iaux = 0
        Do While Not ffound And Not iaux > UBound(tUTFencodedaux)
            If InStr(1, ptstr, tUTFencodedaux(iaux), vbTextCompare) > 0 Then
                ffound = True
            End If
            
            If Not ffound Then
                'ASCII numeric search
                tUTFencodedASCII = ""
                For iaux2 = 1 To Len(tUTFencodedaux(iaux))
                    'gets ASCII numeric sequence
                    tUTFencodedASCII = tUTFencodedASCII & Asc(Mid(tUTFencodedaux(iaux), iaux2, 1)) & "|"
                Next
                'tUTFencodedASCII = Left(tUTFencodedASCII, Len(tUTFencodedASCII) - 1)
                
                'compares numeric sequences
                If InStr(1, ptstrASCII, tUTFencodedASCII) > 0 Then
                    ffound = True
                End If
            End If
            
            iaux = iaux + 1
        Loop
    End If
    
    isUTF8 = ffound
End Function

Function DecodeUTF8(s)
  Dim i
  Dim c
  Dim n
  
  s = s & " "

  i = 1
  Do While i <= Len(s)
    c = Asc(Mid(s, i, 1))
    If c And &H80 Then
      n = 1
      Do While i + n < Len(s)
        If (Asc(Mid(s, i + n, 1)) And &HC0) <> &H80 Then
          Exit Do
        End If
        n = n + 1
      Loop
      If n = 2 And ((c And &HE0) = &HC0) Then
        c = Asc(Mid(s, i + 1, 1)) + &H40 * (c And &H1)
      Else
        c = 191
      End If
      s = Left(s, i - 1) + Chr(c) + Mid(s, i + n)
    End If
    i = i + 1
  Loop
  DecodeUTF8 = s
End Function




Hope it helps

Regards

Diego Sendra
e-mail: contact@diegosendra.com
http://www.diegosendra.com

Is This A Good Question/Topic? 0
  • +

Replies To: How to detect UTF-8-based encoded strings

#2 diegosendra  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 24-June 13

Re: How to detect UTF-8-based encoded strings

Posted 24 June 2013 - 05:17 PM

*Please note you have to download the function from http://www.diegosend.../VB6_isUTF8.txt considering some of the UTF encoded symbols in tUTFencoded variable were lost/deleted at the time of copy/pasting the code into this thread
Was This Post Helpful? 0
  • +
  • -

#3 BobRodes  Icon User is online

  • Your Friendly Local Curmudgeon
  • member icon

Reputation: 572
  • View blog
  • Posts: 2,986
  • Joined: 19-May 09

Re: How to detect UTF-8-based encoded strings

Posted 29 June 2013 - 09:08 PM

What about supporting UTF-8 with the native vb6 rich text box control? Wouldn't that be simpler? (There are some API tricks you can use.)

This post has been edited by BobRodes: 29 June 2013 - 09:12 PM

Was This Post Helpful? 0
  • +
  • -

#4 diegosendra  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 19
  • Joined: 24-June 13

Re: How to detect UTF-8-based encoded strings

Posted 30 June 2013 - 03:47 PM

View PostBobRodes, on 29 June 2013 - 09:08 PM, said:

What about supporting UTF-8 with the native vb6 rich text box control? Wouldn't that be simpler? (There are some API tricks you can use.)


Maybe, but:

a. our approach doesn't relies on an ActiveX control
b. it maybe necessary showing UTF-8 decoded strings somewhere else besis RTF controls, i.e. dropdowns, or listview
c. I think our function can be very helpful when you don't know the source of the stream you are reading, i.e. txt not having BOM header, not having UTF-8 declaration in the header in case of reading an .html file; sources coming out from a database with mixed, LATIN, UTF-8 data

View PostBobRodes, on 29 June 2013 - 09:08 PM, said:

What about supporting UTF-8 with the native vb6 rich text box control? Wouldn't that be simpler? (There are some API tricks you can use.)


Maybe, but:

a. our approach doesn't relies on an ActiveX control nor API tricks
b. it may be necessary showing UTF-8 decoded strings somewhere else besides a RTF control, i.e. dropdowns, listview
c. I think our function can be very helpful when you don't know the source of the stream you are reading, i.e. txt not having BOM header, not having UTF-8 declaration in the header in case of reading an .html file; sources coming out from a database with mixed, ISO 8859-1, Windows-1252 and UTF-8 data
Was This Post Helpful? 0
  • +
  • -

#5 BobRodes  Icon User is online

  • Your Friendly Local Curmudgeon
  • member icon

Reputation: 572
  • View blog
  • Posts: 2,986
  • Joined: 19-May 09

Re: How to detect UTF-8-based encoded strings

Posted 30 June 2013 - 08:44 PM

Well that makes sense. I didn't notice that you weren't asking for help, and was adding that it's possible to get the RTC to support later versions of the RTF spec than 1.0, which is the highest version that it supports directly.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1