2 Replies - 4748 Views - Last Post: 14 August 2012 - 08:29 AM

#1 NickDMax  Icon User is offline

  • Can grep dead trees!
  • member icon

Reputation: 2250
  • View blog
  • Posts: 9,245
  • Joined: 18-February 07

Iterators and UTF8 conversion and std::codecvt

Posted 13 August 2012 - 03:55 PM

I am a little confused on a couple of points with the std:: library.

I have a collection that is accessible via an iterator. It encodes its data in UTF8. It is much too large (or at least potentially so) to convert all at once so I really need to access it "code-point by code-point" via an iterator.

At the moment all I have available is its default the char-based iterator.

Before I go an write my own iterator adapter can the C++11 std::codecvt_utf8 or bost::utf8_codecvt_facet be used to... oh I don't know magically declare an iterator adapter that does the conversion.

Of course this question is coming from the boost documentation here:

Quote

History
This code was originally written as an iterator adaptor over containers for use with UTF-8 encoded strings in memory. Dietmar Kuehl suggested that it would be better provided as a codecvt facet.
which makes me feel dumb for not understanding how the facet is "better" than the iterator adapter or if it is a simple thing to go from one to the other.

Before this gets closed as a "gimme-the-codz" post I don't really need code (hey if were out there I think google would have found it for me :P ) just some help in understanding how locals and facits and std::codecvt's are useful in anything other than streams (there are LOTS of examples with streams and if I need the whole collection converted at once that would be awesome! but I need 1 code point at a time... I need iterators -- and hopefully not an iterator that is just a wrapper for a stringstream with a utf8 facet <edit> or did I just answer my own question?<edit>).

This post has been edited by NickDMax: 13 August 2012 - 03:59 PM


Is This A Good Question/Topic? 0
  • +

Replies To: Iterators and UTF8 conversion and std::codecvt

#2 ishkabible  Icon User is offline

  • spelling expret
  • member icon




Reputation: 1622
  • View blog
  • Posts: 5,709
  • Joined: 03-August 09

Re: Iterators and UTF8 conversion and std::codecvt

Posted 14 August 2012 - 05:53 AM

I haven't found anything yet :/ I actully thought a utf-8 const iterator. I think the issues sort of lies in the issue of the '*' operator. with an iterator adaptor to *really* meet the concept of an iterator it has to be return a constant reference meaning the '&' operator returns a unique address.

that said, returning a value probably wouldn't cause too many issues.

I have also wanted something like what your describing. basically it would have an iterator of characters and read the bits of them to see how many characters it had to increment to increment 1 code point. the '*' would return a UTF-32 code point as an integer.

you may have to make your own :/

This post has been edited by ishkabible: 14 August 2012 - 05:55 AM

Was This Post Helpful? 0
  • +
  • -

#3 NickDMax  Icon User is offline

  • Can grep dead trees!
  • member icon

Reputation: 2250
  • View blog
  • Posts: 9,245
  • Joined: 18-February 07

Re: Iterators and UTF8 conversion and std::codecvt

Posted 14 August 2012 - 08:29 AM

Making the iterator adapter is not much concern really (esp using Boost) but before I did all that I wanted to make sure that I was not missing something with the whole stings/facet/local/traits bit which to be honest I have never really understood. I think that whole think is a mess.

What really bugs me about the std::codecvt classes is that the functions work with raw pointers rather than iterators (unlike the rest of the std library which is in general very generic). This means that if you have a char-buffer the functions are easy to work with, but if you only have iterators not so much.

I think I will just use Boost.Iterators to make an adapter.

It just don't want to go though the exercise just to have someone point out 2 lines of code that would do the same thing all because I don't understand traits/facets/locals etc.

Speaking of which I also need to figure out if isalpha/isdigit etc. work with UCS4... from what I understand they *should* but who knows.

This post has been edited by NickDMax: 14 August 2012 - 08:31 AM

Was This Post Helpful? 0
  • +
  • -

Page 1 of 1