|
Boost : |
Subject: Re: [boost] [string] Realistic API proposal
From: Anders Dalvander (boost_at_[hidden])
Date: 2011-01-29 06:05:35
On 2011-01-28 20:12, Joe Mucchiello<jmucchiello_at_[hidden]> wrote:
> // conversion for Windows API
> std::vector<wchar_t> vec;
> vec.resize(count_codepoints<utf8>(mystring.begin(), mystring.end()));
> convert<utf8,utf16>(mystring.begin(), mystring.end(), vec.begin());
I spy with my little eye a potential crash waiting to happen.
Code-points != Code-units.
vec has room for N code-units, but 2*N code-units may be written to it
if mystring contains non-BMP characters.
"Corrected" code:
std::vector<wchar_t> vec;
vec.resize(count_codeunits<wchar_encoding>(mystring.begin(),
mystring.end()));
convert<wchar_encoding>(mystring.begin(), mystring.end(), vec.begin());
I think a lot of these potential crashes could be prevented if the
iterator of the new string-type (chain,text,tier,yarn) would only expose
(const) code-points. Actual code-units would be hidden, and only
accessed using a facade/adapter view/iterator.
auto u8v = make_view<utf8_encoding>(mystring);
auto u16v = make_view<utf16_encoding>(mystring);
for (auto codepoint : mystring) {...}
for (auto u8codeunit : u8v) {...}
for (auto u16codeunit : u16v) {...}
I also think there isn't a reason that the new string-type *has* to be
UTF-8 internally. It could be UTF-16, UTF-32, SCSU, or CESU-8 internally
for that matter. Making a view from the internal encoding to an external
encoding when both encodings are the same should be a no-op.
Regards,
Anders Dalvander
-- WWFSMD?
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk