Boost logo

Boost :

From: Vladimir Prus (ghost_at_[hidden])
Date: 2004-10-21 01:05:23


Erik Wien wrote:

>> - Why would the user want to change the encoding? Especially between
>> UTF-16 and UTF-32?
>
> Well... Different people have different needs. If you are mostly using
> ASCII characters, and require small size, UTF-8 would fit your bill. If
> you need the best general performance on most operations, use UTF-16. If
> you need fast iteration over code points and size doesn't matter, use
> UTF-32.

Ok, since everybody agreed characters outside 16 bits are very rare, UTF-32
seems to never be needed. As for UTF-8 vs. UTF-16: yes, the need for choice
seems present. However, UTF-16 string class would be better than no string
class at all, and extra genericity will cost you development time.
  
>> - Why would the user want to specify encoding at compile time? Are there
>> performance benefits to that? Basically, if we agree that UTF-32 is not
>> needed, then UTF-16 is the only encoding which does not require complex
>> handling. Maybe, for other encodings using virtual functions in
>> character iterator is OK? And if iterators have abstract characters" as
>> value_type, maybe the overhead if that is much large that virtual
>> function call even for UTF-16.
>
> Though I haven't confirmed this by testing, I would assume templating the
> encoding and thus specifying it at compile time would result in better
> performance since you don't have the overhead of virtual function calls.
> (Polymorphy would probably be needed if templates were scrapped.)

It would. The question is by how much.

> Avoiding
> virtual calls also enables the compiler to optimize (inline) more
> thouroughly, something that is very benificial in this case because of the
> amount of different small, specialized functions that are needed in string
> manipulation.

This is a bit abstract. Virtual function is a inlining barrier, but it would
be placed only for character access. On both sides of the barrier, compiler
can freely optimize everything.

>> - What if the user wants to specify encoding at run time? For example,
>> XML
>> files specify encoding explicitly. I'd want to use ascii/UTF-8 encoding
>> if
>> XML document is 8-bit, and UTF-16 when it's Unicode.
>
> That is one problem with the templating of encoding. You would have to
> ether template all file scanning functions in the XML parser on encoding
> as well, of you would need to do some run-time checks and use the correct
> template depending on the encoding used in the file. This is of course not
> ideal, but only where encoding is something that is specified upon
> run-time. What the most common scenario is, is something that needs to be
> determined before a final design is decided on.

Another possibility is that you can decide if UTF8 of UTF16 should be used
dynamically -- just counting the number of non-ascii characters. That would
mean that only really advanced users need make the decision themself.

I think I'm starting to like Peter's idea that advanced users need
vector<char_xxx> together with a set of algorithms.

- Volodya


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk