Boost logo

Boost :

From: Miro Jurisic (macdev_at_[hidden])
Date: 2005-03-17 18:34:47


In article <d1cam8$sdq$1_at_[hidden]>, Erik Wien <wien_at_[hidden]> wrote:

> Miro Jurisic wrote:
> > Here I also agree. Having multiple string classes would just force everyone
> > to pick one for, in most cases, no good reason whatsoever. If I am writing
> > code that uses C++ strings, which encoding should I choose? Why should I
> > care? Particularly, if I don't care, why would I have to choose anyway?
> > More than likely, I would just choose the same thing 99% of the time
> > anyway.
>
> If we went with an implemetation templated on encoding, I would suggest
> simply having a typedef like todays std::string, let's say "typedef
> encoded_string<utf16_tag> unicode_string;", and market that like "the unicode
> string class". Users that don't care, would use that and be happy, possibly
> not even knowing they are using some template instansiation. Advanced users
> could still easily use one of the other encodings, or even template their
> code to use all of them if found neccesary. But then, like I have said, you
> wouldn't have functions/classes that are encoding independent without
> templating them.

Well, here's what I think -- and this is based entirely on my experience, so I
know it's biased:

1. How much of my code has to deal with strings (manipulation, creation, or
use)? Almost all of it.

2. How much of that code has to know about the encoding? Almost none of it.

Because of this, I really think that for my purposes the right answer is an
encoding-agnostic abstraction.

Now, based on my understanding of where knowledge of encodings is necessary, I
think that my use cases are similar to those of most C++ users. I could be wrong
on that point, of course.

> > I believe that the ability to force a Unicode string to be in a particular
> > encoding has some value -- especially for people doing low-level work such
> > as serializing Unicode strings to XML, and for people who need to
> > understand time and space complexity of various Unicode encodings -- but I
> > do not believe that this justifiable demand for complexity means we should
> > make the interface harder for everyone else.
>
> I agree. But having a templated implementation, would not mean a complex
> interface for the end user. It would probably be simpler than the current
> implementation, since you could loose all the encoding setting and getting.
> Especially if we for for the above mentioned typedef, to remove the template
> syntax for the casual user.

I am not sure that's really true. Let's consider this:

1. When you are passing a boost::unicode_string to an API that uses a different
kind of string, you are going to have to perform some conversion (even if it's
as simple as extracting a wchar_t* from the unicode_string) one way or another.
Therefore, the relative complexity of two possible interfaces in this use case
depends on how easy it is to perform the required conversion.

I think that they can be equally easy to use for this use case.

2. When you are manipulating a boost::unicode_string with boost APIs, I believe
that the two proposed designs would have the same ease of use.

3. When you need to mix and match encodings, then I don't think that the two
APIs can be equally easy to use, primarily because implicit conversions in C++
lead to difficulties. (I assume I don't have to bring up specific examples here.)

I think that the end result of the "typedef encoded_string" design would be that
either I would have to turn every function that uses a string into a template
(which is annoying), or I would have to choose one encoding to use throughout my
code, and this seems unnecessary to me.

Finally, it doesn't make sense to me to pay for the transcoding cost any earlier
than necessary. Consider this code:

unicode_string foo()
{
   return function_that_returns_utf8();
}

With the "typedef encoded_string" design, I am forced to pay for the cost of
transcoding even if the caller of this code will actually need utf_8.

So, to summarize, my opinion is that in applications in which one encoding is
used throughout the application (and note that this really means "the
application and all boost::unicode_string-savvy libraries it uses") the typedef
approach is probably as easy as the class approach (and faster, because it
eliminates vtable dispatch), whereas in applications in which more than one
encoding is used, the benefit of avoiding the vtable dispatch will be offset by
having to pay for transcoding cost upfront.

In my opinion, having boost::unicode_string_utfN for the situations in which
encoding is important and boost::unicode_string which can hold any encoding is
better than not having a string that can hold any encoding. (I am sure that if
we decide to accept this library with typedef unicode_string_utfM
unicode_string, the first thing I'll need is my own encoding-agnostic
unicode_string...)

> > I do, however, think that some people are going to feel that they need to
> > eliminate the runtime overhead of generalized strings and explicitly
> > instantiate strings in a particular encoding, and I don't know whether the
> > library currently provides a facility to accomplish this.
>
> It doesn't currently. But it would be pretty simple to create an
> implementation that allows that through use of the encoding_traits classes. I
> have done that before, and could probably use most of that code again if we
> were to include that.

I think that it should provide this, but I don't demand that it provide it right
away.

meeroh


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk