Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-24 16:34:54


On Mon, 24 Jan 2011 19:28:50 +0800
Dean Michael Berris <mikhailberis_at_[hidden]> wrote:

> On Mon, Jan 24, 2011 at 3:04 PM, Patrick Horgan <phorgan1_at_[hidden]>
> wrote:
>
>> [...] I'm with you here, but to be fair to Chad, you could add to
>> that list a string of utf-8 encoded characters.  If a string contains
>> things with a particular encoding  there's value in being able to
>> keep track of whether it's validly encoded.  It may very well be
>> that a std::string is part of another type, or that there's some
>> encoding wrapper that lets you see it as utf-8 in the same way an
>> external iterator lets you look at chars.
>
> Sure, however I personally don't see the value of making the encoding
> an intrinsic property of a string object. [...]

Then I think we have different purposes, and I'll absent myself from
this part of the discussion after this reply.

Before I go, I'll note in passing that I've started on the
modifications to the UTF types, and I found that it made sense to omit
many of the mutating functions from utf8_t and utf16_t, at least the
ones that operate on anything other than the end of the string.

>> Are you saying that you try it as utf-8, it doesn't decode and then
>> you try utf-32 to see if it works?  Cause the same string couldn't
>> be both.   Or are you saying that the string has some underlying
>> encoding but something lets it be viewed in other encodings, for
>> example it might actually be EUC, but external iterators let you
>> view it as utf-8 or utf-16 or utf-32 interpreting on the fly?
>
> I'm saying the string could contain whatever it contains (which is
> largely of little consequence) but that you can give a "view" of the
> string as UTF-8 if it's valid UTF-8, or UTF-32 if it's valid UTF-32.
> [...]

For what it's worth, that's the basic concept that I've adopted for the
utf*_t modifications. The utf*_t gives only a code-point iterator (you
can also get a char/char16_t/char32_t iterator from the type returned
by the encoded() function). I plan to write a separate character
iterator that will accept code-points and return actual Unicode
characters.

-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk