Boost logo

Boost :

Subject: Re: [boost] [string] Realistic API proposal
From: Anders Dalvander (boost_at_[hidden])
Date: 2011-01-29 06:05:35


On 2011-01-28 20:12, Joe Mucchiello<jmucchiello_at_[hidden]> wrote:
> // conversion for Windows API
> std::vector<wchar_t> vec;
> vec.resize(count_codepoints<utf8>(mystring.begin(), mystring.end()));
> convert<utf8,utf16>(mystring.begin(), mystring.end(), vec.begin());

I spy with my little eye a potential crash waiting to happen.
Code-points != Code-units.
vec has room for N code-units, but 2*N code-units may be written to it
if mystring contains non-BMP characters.

"Corrected" code:

    std::vector<wchar_t> vec;
    vec.resize(count_codeunits<wchar_encoding>(mystring.begin(),
mystring.end()));
    convert<wchar_encoding>(mystring.begin(), mystring.end(), vec.begin());

I think a lot of these potential crashes could be prevented if the
iterator of the new string-type (chain,text,tier,yarn) would only expose
(const) code-points. Actual code-units would be hidden, and only
accessed using a facade/adapter view/iterator.

    auto u8v = make_view<utf8_encoding>(mystring);
    auto u16v = make_view<utf16_encoding>(mystring);

    for (auto codepoint : mystring) {...}
    for (auto u8codeunit : u8v) {...}
    for (auto u16codeunit : u16v) {...}

I also think there isn't a reason that the new string-type *has* to be
UTF-8 internally. It could be UTF-16, UTF-32, SCSU, or CESU-8 internally
for that matter. Making a view from the internal encoding to an external
encoding when both encodings are the same should be a no-op.

Regards,
Anders Dalvander

-- 
WWFSMD?

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk