Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-19 08:02:25


On Wed, 19 Jan 2011 00:44:39 -0800 (PST)
Artyom <artyomtnk_at_[hidden]> wrote:

>> From: Chad Nelson <chad.thecomfychair_at_[hidden]>
>>> 3. There is not special type char8_t distinct from char, so you
>>> can't use it.
>>
>> That's why I wrote the utf8_t type. I'd have been quite happy to
>> just use an std::basic_string<utf8_byte_t>, and I looked into the
>> C++0x "opaque typedef" idea to see if it was possible.
>
> Even if opaque typedef would be included in C++0x it would be still
> not feasable for use as string. [...] I had facet this problem when
> tested char16_t/char32_t under gcc with partial standard library
> implementation that hadn't specialized classes for them and I
> couldn't get many things works.
>
> This is real problem.
>
> So it would just not work even if C++0x had opaque typedefs.

I think there would have been ways around the problem. For the example
you quoted, the most logical solution would probably be to just use
basic_stringstream<wchar_t> and convert the string afterward. Not a
very satisfying solution, but it would have worked.

In any case, the point is moot, since opaque typedefs won't be in C++0x.

>>> Ok...
>>>
>>> The paragraph above is inheritable wrong
>>
>> Oh?
>>
>
> I hope you are not offended but I just had seen so many things go
> wrong because of such assumptions so I'm little bit frustrated that
> such things come again and again.

'Fraid they'll continue to come up, because there are always new
developers and there isn't a lot of information on the subject
available where a developer would stumble into it by accident. Having a
set of UTF string types with three different kinds of iterators would
at least make some C++ programmers realize that the problem exists,
when they wouldn't have before.

>> Or if your program allows the user to edit a file, you want
>> something that gives you single characters, regardless of how many
>> bytes or code-points they're encoded in.
>>
>
> That is what great Unicode aware toolkits like Qt, GtkMM and others
> with hundreds and thousands of lines of code do for you. [...]

Which is great, if you happen to be using a Qt-based or Gtk-based
interface in your program, but useless if you're not. I'd prefer a
solution that's not tied to monolithic libraries that try to deliver
everything and the kitchen sink.

>> I'm trying to understand your point, but with no success so far. If
>> you want something that gives you characters or code-points, then an
>> std::string has no chance of working in any multi-byte encoding -- a
>> UTF-whatever-specific type does.
>>
>
> It works perfectly well. However for text analysis you either:
>
> 1. Use a library like Boost.Locale
> 2. Relate to the ASCII subset of the text allowing to handle 99% of
> various formats there - you don't need code point iterators for
> this.

Why would you want to do either of those, when something like a utf8_t
class could make the Boost.Locale interface easier and more intuitively
obvious to use, and eliminate the ASCII restriction too?

>> Your point seems to be that the utf*_t classes are actively harmful
>> in some way that I don't see, and using std::string somehow
>> mitigates that by making you do more work. Or am I misunderstanding
>> you?
>>
>
> My statement is following:
>
> - utf*_t would not give any real added value - that what I was trying
> to show, and of you want to iterate over codepoints you can do it with
> external iterator over std::string perfectly well.

You can also handle strings perfectly well the C way, with manually
allocated memory, strcpy, strlen, and the like. But you still see the
benefits of using an std::string class.

> But in most cases you don't want to iterate over codepoints but
> rather characters words and other text entities and codepoints would
> not help you with this.

An explicit *character* iterator, over a UTF-type, would solve that
problem.

>- utf*_t would create troubles as it would require instant conversions
> between utf*_t types and 1001 other libraries.
>
> And what is even more important it can't be simply integrated into
> existing C++ string framework.

Oh? :-)

The way I'm envisioning it, you could do something like this...

    utf8_t foo = some_utf8_text;
    cout << *native_t(foo);

...to send a string to stdout. It would be automatically transcoded to
the system's current code page (probably using Boost.Locale) if the
code-page isn't already UTF-8, and the asterisk would provide an
std::string in that type. Though of course, the utf*_t classes would be
provided with an output operator of their own that would take care of
that for you, so you wouldn't have to.

If you needed to interface with a Windows API function...

    utf16_t bar = foo; // Automatic conversion
    DrawTextW(dc, bar->c_str(), bar->length(), rect, flags);

...that would do the trick, and would probably get buried in a library
function of some sort that takes a utf16_t type. If you fed it an
std::string, std::wstring, or utf32_t type, it would be automatically
converted when the function is called. And if you fed it a utf16_t, of
course, no conversion would be done, it would be used as-is.

So while you might have to do some conversion to other string types to
interface with different existing libraries (like Qt), the process is
very simple and can probably be automated. *If* you decided to use the
utf*_t types at all. And as I've said before, you can simply use
std::string for any functions that are encoding-agnostic.

> - All I suggest is when you work on windows - don't use ANSI
> encodings, assume that std::string is UTF-8 encoded and convert it to
> UTF-16 on system call boundaries.

Assumptions like that will cause problems for existing codebases, which
are probably using std::strings in ways that would break. Can something
as widely used as Boost afford to make a breaking change like that?

On the other hand, with a set of UTF types, you could provide two
overloads, one that blindly operates on std::strings as it does now, and
one that works on the most convenient UTF form, which would
automatically provide some guarantees about the content (such as, that
it's valid).

If, of course, the function you're using cares about the encoding. As
you pointed out, most won't, and can be left using std::string with no
problem.

And if you, the function's author, want to move away from the
std::string form, you just mark it deprecated and leave it there with a
warning about when it will go away. The company using the library can
make its own decision about whether to upgrade beyond that point or
not. I don't foresee many authors with a need for that kind of thing,
but for those that do, it would be nice if it were there.

> Basically - don't reinvent things, try to make current code work well
> - it has some design flaws by overall C++/STL is totally fine for
> unicode handing it needs some things to improve but providing utf*_t
> classes is not the way to go.
>
> This is **My** point of view.

Thanks for making it clear. I have to disagree though. Most programmers
don't want to delve into Unicode and learn about the intricacies of
code-points and the like. They just want to use it. The UTF string
types should let them do so, in most cases, with a much gentler
learning curve than using ICU (or even Boost.Locale) directly.

-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk