|
Boost : |
Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-15 04:25:04
> From: Patrick Horgan <phorgan1_at_[hidden]>
> On 01/14/2011 02:05 PM, Peter Dimov wrote:
> > John B. Turpish wrote:
> > > By the way, I disagree with Peter's assessment that, "you rarely, if
ever,
> > > need to access the Nth character," but I will gladly cede that
> > > this depends on your problem domain.
> >
> > It obviously depends on the problem domain :-) but, when
> > talking about Unicode, you can't reliably access the Nth character,
> > in general, even with UCS-32. (As far as I know.)
>
> I don't understand. UCS-32 (I assume you meant encoded as UTF-32)
> is a fixed width encoding so the n-th character is just
> 4n away from the beginning of the string. Right?
No,
Nth Unicode code-point is at nth position not a character.
For example in word "שָ××Ö¹×" as 4 characters "שָ"â, "×"â, "×Ö¹"â, "×"â and 6
code points: שâ Ö¸â ×â ×â Ö¹â ×
Where two code points are diacritic marks.
Boost.Locale has special character iterator to handle characters for this
purpose and it
works on characters and not code points.
See:
http://cppcms.sourceforge.net/boost_locale/html/tutorial.html#8e296a067a37563370ded05f5a3bf3ec
Artyom
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk