Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2005-03-16 12:13:36


Thorsten Ottosen wrote:
> Hi Erik,

Hi! Thanks for your reply.

> Let me first say that its good to see that progress is happening
> on this important topic.
>
> Here are just some small comments; I didn't follow the first discussion,
> so maybe these things have already been answered.
>
> | Current design:
> | The current design is based around the concept of «encoding_traits».
>
> Is entirely improper to make unicode strings a typedef
> for std::basic_string<...> ?

Not entirely, but certainly less that optimal. basic_string (and the
iostreams) make assuptions that don't neccesarily apply to Unicode text.
One of them is that strings can be represented as a sequence of equally
sized characters. Unicode can be represented that way, but that would
mean you'd have to use 32 bits pr. character to be able to represent all
the code point assigned in the Unicode standard. In most cases, that is
way too much overhead for a string, and usually also a waste, since
unicode code points rarely require more that 16 bits to be encoded. You
could of course implement unicode for 16 bit characters in basic_string,
but that would require that the user know about things like surrogate
pairs, and also know how to correctly handle them. An unlikely scenario.

By using encoding_traits however, we are able to make a string class
that internally works with 8, 16 or 32 bit code units (UTF-8, 16 and 32
respectively), but that has an external interface that uses 32 bit code
points, abstracting away the underlying encoding. By doing it that way
we easily halve the effective size of a string for most users. (When
using UTF-16 for example)

> and what is the benefit of having a function vs a function template?
> surely a function template will look the same to the client as an ordinary
> function; Is it often used that people must change encoding on the fly?

Normally I would not think so, and my first implementation did not work
this way. That one was implemented with the entire string class being
templated on encoding, and thereby eliminating the whole implementation
inheritance tree in this implementation.

There was however (as far as I could tell at least) some concern about
this approach in the other thread. (Mostly related to code size and
being locked into an encoding at compile time.) Some thought that could
be a problem for XML parsers and related technology that needs to
establish encoding at run-time. (When reading files for example) This
new implementation was simply a test to see if an alternate solution
could be found, without those drawbacks. (It has a plenthora of new ones
though.)

I am more than willing to change this if the current design is no good.
Starting a discussion on this is one of my main reasons for posting the
code in the first place.

> |You do however gain speed (I would assume), since you
> |wouldn't have the overhead of virtual function-calls, as well as a less
> |complex implementation.
>
> It would be good to see some real data on how much slower it gets. If the
> slowdown is high, then you should consider a two-layered approach
> (implementing the virtual functions in terms of the non-virtual) or
> to remove the virtual functions altogether.

Yep. Some profiling of the different designs would be a good idea, and
will probably be done in the near future.

> -Thorsten

- Erik


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk