Boost logo

Boost :

From: Miro Jurisic (macdev_at_[hidden])
Date: 2005-03-16 14:53:06


In article <d19pdf$jhu$1_at_[hidden]>, Erik Wien <wien_at_[hidden]> wrote:

> Thorsten Ottosen wrote:
> > | Current design:
> > | The current design is based around the concept of «encoding traits».
> >
> > Is entirely improper to make unicode strings a typedef
> > for std::basic string<...> ?
>
> Not entirely, but certainly less that optimal. basic string (and the
> iostreams) make assuptions that don't neccesarily apply to Unicode text.
> One of them is that strings can be represented as a sequence of equally
> sized characters. Unicode can be represented that way, but that would
> mean you'd have to use 32 bits pr. character to be able to represent all
> the code point assigned in the Unicode standard. In most cases, that is
> way too much overhead for a string, and usually also a waste, since
> unicode code points rarely require more that 16 bits to be encoded. You
> could of course implement unicode for 16 bit characters in basic string,
> but that would require that the user know about things like surrogate
> pairs, and also know how to correctly handle them. An unlikely scenario.

I completely agree with Erik on this. std::string makes assumptions that do not
hold for Unicode characters, and it provides interfaces that are misleading (or
outright wrong) for Unicode strings. For example, basic_string lets you erase a
single element, which can make the string no longer be a valid Unicode string
(unless the elements are represented in UTF32). Same problem exists with every
other mutating algorithm on basic_string, including operator[].

> > and what is the benefit of having a function vs a function template?
> > surely a function template will look the same to the client as an ordinary
> > function; Is it often used that people must change encoding on the fly?
>
> Normally I would not think so, and my first implementation did not work
> this way. That one was implemented with the entire string class being
> templated on encoding, and thereby eliminating the whole implementation
> inheritance tree in this implementation.
>
> There was however (as far as I could tell at least) some concern about
> this approach in the other thread. (Mostly related to code size and
> being locked into an encoding at compile time.) Some thought that could
> be a problem for XML parsers and related technology that needs to
> establish encoding at run-time. (When reading files for example) This
> new implementation was simply a test to see if an alternate solution
> could be found, without those drawbacks. (It has a plenthora of new ones
> though.)

Here I also agree. Having multiple string classes would just force everyone to
pick one for, in most cases, no good reason whatsoever. If I am writing code
that uses C++ strings, which encoding should I choose? Why should I care?
Particularly, if I don't care, why would I have to choose anyway? More than
likely, I would just choose the same thing 99% of the time anyway.

I believe that the ability to force a Unicode string to be in a particular
encoding has some value -- especially for people doing low-level work such as
serializing Unicode strings to XML, and for people who need to understand time
and space complexity of various Unicode encodings -- but I do not believe that
this justifiable demand for complexity means we should make the interface harder
for everyone else.

I do, however, think that some people are going to feel that they need to
eliminate the runtime overhead of generalized strings and explicitly instantiate
strings in a particular encoding, and I don't know whether the library currently
provides a facility to accomplish this.

meeroh


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk