Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-19 03:44:39
> From: Chad Nelson <chad.thecomfychair_at_[hidden]>
> On Tue, 18 Jan 2011 11:01:10 -0800 (PST)
> Artyom <artyomtnk_at_[hidden]> wrote:
> >>> 2. Reinvent standard library to use new string
> >> Not entirely necessary, for the same reason that very few changes to
> >> the standard library are needed when you switch from char strings to
> >> char16_t strings to char32_t strings -- the standard library,
> >> designed around the idea of iterators, is mostly type-agnostic.
> > Ok... Few things:
> > 1. UTF-32 is waste of space - don't use it unless it is something
> > like handling code points (char32_t)
> > 2. UTF-16 is too error prone (See: UTF-16 considered harmful)
> No argument with either assertion.
> > 3. There is not special type char8_t distinct from char, so you
> > can't use it.
> That's why I wrote the utf8_t type. I'd have been quite happy to just
> use an std::basic_string<utf8_byte_t>, and I looked into the C++0x
> "opaque typedef" idea to see if it was possible.
Even if opaque typedef would be included in C++0x it would be still not feasable
for use as string.
See with character come many other goodies standard library provides.
Why this works:
ss << 10.4;
And this does not:
ss << 10.4;
This is not only because you have problems overloading << for both unsigned int
as number and as "character" it is because when you would try to write double
into stream it would call:
1. It is not defined and not installed facet
2. It may be even not possible to create because some facets are
explicitly specialized for character types, like for example
codecvt fact is specialized for char and wchar_t (and in C++0x char16_t,
I had facet this problem when tested char16_t/char32_t under gcc with partial
standard library implementation that hadn't specialized classes for them and
I couldn't get many things works.
This is real problem.
So it would just not work even if C++0x had opaque typedefs.
> >> so they'll work
> >> fine with most library functions, so long as those functions don't
> >> care that some characters are encoded as multiple bytes. It's just
> >> the ones that assume that a single byte represents all characters
> >> that you have to replace, and you'd have to replace those
> >> regardless of whether you're using a new string type or not, if
> >> you're using any multi-byte encoding.
> > Ok...
> > The paragraph above is inheritable wrong
I hope you are not offended but I just had seen so many
things go wrong because of such assumptions so I'm little
bit frustrated that such things come again and again.
> > Once again - when you work with string you don't work with them as
> > series of characters you want with them and text entities - text
> > chunks.
> That depends on what you're doing with them. If you're using them as
> translations for messages your program is sending out, then your
> statement is correct -- you treat them as opaque blobs. But if for
> instance you're parsing a file, you want tokens, which *are* merely an
> arbitrary series of characters.
What tokens are constructed of? Series of character, right,
series of characters are represented as text chunks and they
can be searched easily.
If fact I had written some JSON and HTML parsers that are
fully encoding and UTF-8 aware without any need to access specific
Note: I did validated the text - that it is valid UTF-8 but that
is separate stage, but it does not require from me to iterate
over each code point.
> Or if your program allows the user to
> edit a file, you want something that gives you single characters,
> regardless of how many bytes or code-points they're encoded in.
That is what great Unicode aware toolkits like Qt, GtkMM and others
with hundreds and thousands of lines of code do for you.
Of course you may use Boost.Locale that provide character, work,
sentence and line break iterators over plain strings very well.
> I'm trying to understand your point, but with no success so far. If you
> want something that gives you characters or code-points, then an
> std::string has no chance of working in any multi-byte encoding -- a
> UTF-whatever-specific type does.
It works perfectly well. However for text analysis you either:
1. Use a library like Boost.Locale
2. Relate to the ASCII subset of the text allowing to handle 99% of
various formats there - you don't need code point iterators for this.
> Your point seems to be that the utf*_t classes are actively harmful in
> some way that I don't see, and using std::string somehow mitigates that
> by making you do more work. Or am I misunderstanding you?
My statement is following:
- utf*_t would not give any real added value - that what I was trying to show,
and of you want to iterate over codepoints you can do it with external
over std::string perfectly well.
But in most cases you don't want to iterate over codepoints but rather
words and other text entities and codepoints would not help you with this.
- utf*_t would create troubles as it would require instant conversions
between utf*_t types and 1001 other libraries.
And what is even more important it can't be simply integrated
into existing C++ string framework.
- All I suggest is when you work on windows - don't use ANSI encodings, assume
that std::string is UTF-8 encoded and convert it to UTF-16 on system call
Basically - don't reinvent things, try to make current code
work well - it has some design flaws by overall C++/STL is
totally fine for unicode handing it needs some things to improve
but providing utf*_t classes is not the way to go.
This is **My** point of view.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk