Boost logo

Boost :

Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: Kirit Sælensminde (kirit.saelensminde_at_[hidden])
Date: 2008-11-19 23:25:19


Sebastian Redl wrote:
> Phil Endecott wrote:
>> James Porter wrote:
>>> Over the past few months, I've been tinkering with a Unicode string
>>> library. It's still *far* from finished, but it's far enough along
>>> that the overall structure is visible. I've seen a bunch of Unicode
>>> proposals for Boost come and go, so hopefully this one will address
>>> the most common needs people have.
>>
>> Hi Jim,
>>
>> Mine was probably one of those proposals that you looked at; for the
>> record the code is all available at
>>
>> http://svn.chezphil.org/libpbe/trunk/include/charset/
>>
>> and nearby directories.
> While we're throwing out Unicode libraries, I have my own tagged
> strings. They're quite tied up with other code around them, so I won't
> post them at this time, but the basic concepts work like this:

My attempt at Unicode string handling can be seen at:

svn://svn.felspar.com/public/fost-base/stable/

(Username is 'guest', password is blank)

What I did was a bit different again. I'm not concerned about
non-Unicode encoding as I don't deal with random text files -- they're
either created by our code, or they're small config files where we can
say "just use UTF-8" -- generally we're reading from databases which are
Unicode, or writing web pages where we can use Unicode without problem.

What I did was to wrap the std::string/wstring class, using std::wstring
on Windows and std::string on Linux -- so on Linux everything stays as
UTF-8 and on Windows it is UTF-16.

In the wrapper I throw away the mutable iterators and the mutating
operator[]. Iterators and operator[] dereference to a UTF-32 code point.
For the most part the interface follows that of std::basic_string<> and
we've added std_str() to get at the underlying std:: string if you need it.

In the source code on all platform we assume L"wide literal" which are
UTF-16 encoded, even on platforms where wchar_t is 32 bit. This seems to
be a bit more convenient, and means that when the new Unicode character
literals get supported we can easily change.

I've been using this style of implementation for more than 5 years (the
code linked is new as the Linux handling is new), since the MSVC 6 days
when we had to wrap Microsoft's std::wstring as it used a non-thread
safe COW implementation. It seems to work pretty well. I've been using
it for so long that I can't tell if it's intuitive or not -- I'm just
used to it.

The linear scans from the start of the string to calculate string
boundaries (given that offsets are always in UTF-32 code points) does
introduce some performance penalty. On the previous Windows
implementation we cached the UTF-16 size and the UTF-32 size -- if
they're the same (as they are nearly all the time) then you know you can
safely use an offset without decoding everything. We don't have the
experience on Linux where the underlying string is UTF-8 to know how
this would work out there. What I'm now thinking is that it would
probably be worth it to use a "rope" implementation built on top of this
underlying string which also allows us to address some other use cases
which could be faster.

Kirit


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk