Boost logo

Boost :

From: Felipe Magno de Almeida (felipe.m.almeida_at_[hidden])
Date: 2008-02-25 16:59:55


On Mon, Feb 25, 2008 at 6:06 PM, Phil Endecott
<spam_from_boost_dev_at_[hidden]> wrote:
> Felipe Magno de Almeida wrote:
>
> > On Mon, Feb 25, 2008 at 8:09 AM, Sebastian Redl
> > <sebastian.redl_at_[hidden]> wrote:

[snip]

> >> I think emulating std::string doesn't work. It has a naive design based
> >> on the assumption of fixed-width encodings. I think that a tagged string
> >> is the best place to really start over with a string design and produce
> >> a string that is lean, rather than bloated.
> >
> > I agree.
>
> Hmmm. I hear what you're saying, but things that are too revolutionary
> don't get used because they're too different from what people are used
> to. I'd like to offer something that's close to a drop-in replacement
> for std::string that will let people painlessly upgrade their code to
> proper character set support.

I would really much use it. And am not very concerned if some
algorithms would have to change. I'm now using icu directly, and it is
quite a PITA.

> However, most of the work that I have done has been at a lower level
> and can be easily built upon to enable a new class with a different
> interface as well. So you can have your cake and eat it! Comments
> about both are welcome.

You could create a bloated_utf8 as a drop-in replacement for
std::string, and at the same time discouraging its use. :P

> >> I think the string type should offer minimal manipulation facilities -
> >> either completely read-only or append as the only manipulation function.
> >
> > I would like to have at least a modifiable string. But only through
> > iterators (insert and erase).
> > That should suffice all my algorithm needs.
>
> Try this: temporarily replace all your strings with list<character> and
> see what's missing.

I did (not *all*, but in very significant places).
The first problem I got was unnecessary requiring
RandomAccessIterators, like using operator+ instead of std::advance.
Other places uses std::string::size_type and operator[].
But I can say these are easily correctable.

> >> A string buffer type could be written as a mutable alternative, as is
> >> the design in Java and C#. However, I'm not sure how much of that
> >> interface is needed, either.
>
> I'm unfamiliar with what Java and C# do, but my lower-level code (e.g.
> character_output_iterator) make it simple to write e.g. UTF-8 into
> arbitrary memory.

Good.

> > A modifiable iterator interface (with insert and erase) is, IMO, as
> > concise and extensible as possible.
> >
> >> I'd love to have some empirical data on string usage.
> >
> > I do some string manipulations on email. And it is usually better to
> > do all manipulations in the codepage received, instead of converting
> > back and forth.
>
> One issue that I'm currently thinking about with this sort of usage is
> compile-time character set tagging vs. run-time character set tagging.
> In fact, I've been wondering whether there is some general pattern for
> providing both e.g.
>
> template <charset_t cset> void foo(int x);
> and
> void foo(charset_t cset, int x);

I can say I won't be using much compile-time tagged strings.
But, I guess you could do:

template <typename Char, typename Charset> struct compiletime_string;
template <typename Char> struct string
{
  template <typename Charset>
  string(compiletime_string<Char, Charset> const& s);
}

And then you can have compile-time tagged strings and runtime tagged
strings work together seamlessly.

> You can obviously forward from the first to the second but that may
> lose some compile-time-constant optimisations; forwarding from the
> second to the first needs a horrible case statement. I was wondering
> about a macro that would define both.... any ideas anyone?

I guess a macro wouldn't be a very good idea.
You can just do some if's in the runtime_tagged and forward to the
compile-time function for cases where you have a optimized
compile-time version for those charsets. For all others, just
execute a common function (based on iconv maybe) just passing the
character set name.
You could have a map for compile-time character set to c-string
character set name.

> >> > - What character sets are people interested in using (a) at the "edges"
> >> > of their programs,
> >> As many as possible. Theoretically, a program might have to deal with
> >> any and all encodings out there. Realistically, there's probably a dozen
> >> or two that are relevant. You'd need empirical data.
>
> I have looked at the charsets in all my email, but the results are
> thrown by the spam.
>
>
> > Unfortunately I need all supported by MIME.
>
> Falling back using e.g. iconv() for the otherwise-unsupported ones is
> my plan.

That's good enough to me.

[snip]

> Cheers,
>
> Phil.

Regards,

-- 
Felipe Magno de Almeida

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk