|
Boost : |
Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2008-11-19 16:18:27
Phil Endecott wrote:
> James Porter wrote:
>> Over the past few months, I've been tinkering with a Unicode string
>> library. It's still *far* from finished, but it's far enough along
>> that the overall structure is visible. I've seen a bunch of Unicode
>> proposals for Boost come and go, so hopefully this one will address
>> the most common needs people have.
>
> Hi Jim,
>
> Mine was probably one of those proposals that you looked at; for the
> record the code is all available at
>
> http://svn.chezphil.org/libpbe/trunk/include/charset/
>
> and nearby directories.
While we're throwing out Unicode libraries, I have my own tagged
strings. They're quite tied up with other code around them, so I won't
post them at this time, but the basic concepts work like this:
1) There are two string templates, templated on the character set. One
is a replace-only, reference-counted string (i.e. you can do s = other,
but it's otherwise immutable) with shared storage even for substrings
(no zero-termination guarantee). The other is a non-shared, mutable
string with the guarantee (and small-string optimization, but that's a
detail).
2) They implement the same basic concept for non-mutating access:
bidirectional iterators, a data() function, a unit_count() (the length
in encoding base units) and a char_count() (the number of codepoints),
substr() (with iterator arguments), and a free compare() with strcmp
interface, as well as relational operators. Assignment and explicit
construction from a string with another encoding is possible,
transcoding is done automatically.
The mutable string additionally has a replace() with an iterator range
to replace and various ways of specifying the string to insert, as well
as a lot of functions that can be built upon this (insert(), append(),
erase()). Because the positions are always specified with iterators,
it's not possible to split a multi-unit codepoint.
3) The error handling policy is a runtime parameter. Three policies are
defined, not extensible: skip bad characters, replace bad characters
with an encoding-specific replacement character, throw an exception.
These are enforced on storing the data - it's not possible to have a
string object holding invalid data.
The low-level machinery is based on overloading of free functions that
get a tag parameter for the character set passed, to define the
characteristics of the character set. Built on that is an iterator
adapter that uses this interface to adapt any bidirectional range whose
value type is convertible to the encoding base type to a character sequence.
These string classes are actually part of an I/O framework, and the
intended usage is that character sets are converted upon I/O, so that
the internally used character set is always statically known. Thus,
there is no way to get a string with a runtime-determined encoding.
Some examples from the test cases:
BOOST_AUTO_TEST_CASE( roundtrip )
{
fast_substring<native_narrow_encoding> start =
string_literal("GrüÃe, Wâ¬lt!");
xstring<utf8> as_utf8(start);
fast_substring<iso_8859_15> as_iso_8859_15(as_utf8);
fast_substring<utf16le> as_utf16le(as_iso_8859_15);
xstring<utf16> as_utf16(as_utf16le);
xstring<windows_1252> as_windows_1252(as_utf16);
fast_substring<utf16be> as_utf16be(as_windows_1252);
fast_substring<native_narrow_encoding> finish(as_utf16be);
BOOST_CHECK(start == finish);
}
BOOST_AUTO_TEST_CASE( string_literal )
{
str_t s(str::string_literal("Gr\u00FC\u00DFe, Welt!"));
BOOST_CHECK_EQUAL(s.unit_count(), 14u);
BOOST_CHECK_EQUAL(s.char_count(), 12u);
}
I've implemented UTF-8, UTF-16 based on uint16_t, UTF-16 big and little
endian, based on bytes, and UTF-32 based on uint32_t. I've also
implemented ISO-8859-1, ISO-8859-15 and Windows-1252.
If you're interested, I can try extracting the code, or post the whole
thing.
Sebastian
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk