Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Dave Abrahams (dave_at_[hidden])
Date: 2011-01-14 22:16:01


On Fri, Jan 14, 2011 at 9:35 PM, Patrick Horgan <phorgan1_at_[hidden]> wrote:
> On 01/14/2011 02:05 PM, Peter Dimov wrote:
>>
>> John B. Turpish wrote:
>>>
>>> By the way, I disagree with Peter's assessment that, "you rarely, if
>>> ever, need to access the Nth character," but I will gladly cede that this
>>> depends on your problem domain.
>>
>> It obviously depends on the problem domain :-) but, when talking about
>> Unicode, you can't reliably access the Nth character, in general, even with
>> UCS-32. (As far as I know.)
>
> I don't understand.  UCS-32 (I assume you meant encoded as UTF-32) is a
> fixed width encoding so the n-th character is just 4n away from the
> beginning of the string.  Right?

No. The nth code point is 4n bytes from the beginning of the string,
but characters may be made of a combination of adjacent code points.

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk