From: Felipe Magno de Almeida (felipe.m.almeida_at_[hidden])
Date: 2008-02-25 16:59:55
On Mon, Feb 25, 2008 at 6:06 PM, Phil Endecott
> Felipe Magno de Almeida wrote:
> > On Mon, Feb 25, 2008 at 8:09 AM, Sebastian Redl
> > <sebastian.redl_at_[hidden]> wrote:
> >> I think emulating std::string doesn't work. It has a naive design based
> >> on the assumption of fixed-width encodings. I think that a tagged string
> >> is the best place to really start over with a string design and produce
> >> a string that is lean, rather than bloated.
> > I agree.
> Hmmm. I hear what you're saying, but things that are too revolutionary
> don't get used because they're too different from what people are used
> to. I'd like to offer something that's close to a drop-in replacement
> for std::string that will let people painlessly upgrade their code to
> proper character set support.
I would really much use it. And am not very concerned if some
algorithms would have to change. I'm now using icu directly, and it is
quite a PITA.
> However, most of the work that I have done has been at a lower level
> and can be easily built upon to enable a new class with a different
> interface as well. So you can have your cake and eat it! Comments
> about both are welcome.
You could create a bloated_utf8 as a drop-in replacement for
std::string, and at the same time discouraging its use. :P
> >> I think the string type should offer minimal manipulation facilities -
> >> either completely read-only or append as the only manipulation function.
> > I would like to have at least a modifiable string. But only through
> > iterators (insert and erase).
> > That should suffice all my algorithm needs.
> Try this: temporarily replace all your strings with list<character> and
> see what's missing.
I did (not *all*, but in very significant places).
The first problem I got was unnecessary requiring
RandomAccessIterators, like using operator+ instead of std::advance.
Other places uses std::string::size_type and operator.
But I can say these are easily correctable.
> >> A string buffer type could be written as a mutable alternative, as is
> >> the design in Java and C#. However, I'm not sure how much of that
> >> interface is needed, either.
> I'm unfamiliar with what Java and C# do, but my lower-level code (e.g.
> character_output_iterator) make it simple to write e.g. UTF-8 into
> arbitrary memory.
> > A modifiable iterator interface (with insert and erase) is, IMO, as
> > concise and extensible as possible.
> >> I'd love to have some empirical data on string usage.
> > I do some string manipulations on email. And it is usually better to
> > do all manipulations in the codepage received, instead of converting
> > back and forth.
> One issue that I'm currently thinking about with this sort of usage is
> compile-time character set tagging vs. run-time character set tagging.
> In fact, I've been wondering whether there is some general pattern for
> providing both e.g.
> template <charset_t cset> void foo(int x);
> void foo(charset_t cset, int x);
I can say I won't be using much compile-time tagged strings.
But, I guess you could do:
template <typename Char, typename Charset> struct compiletime_string;
template <typename Char> struct string
template <typename Charset>
string(compiletime_string<Char, Charset> const& s);
And then you can have compile-time tagged strings and runtime tagged
strings work together seamlessly.
> You can obviously forward from the first to the second but that may
> lose some compile-time-constant optimisations; forwarding from the
> second to the first needs a horrible case statement. I was wondering
> about a macro that would define both.... any ideas anyone?
I guess a macro wouldn't be a very good idea.
You can just do some if's in the runtime_tagged and forward to the
compile-time function for cases where you have a optimized
compile-time version for those charsets. For all others, just
execute a common function (based on iconv maybe) just passing the
character set name.
You could have a map for compile-time character set to c-string
character set name.
> >> > - What character sets are people interested in using (a) at the "edges"
> >> > of their programs,
> >> As many as possible. Theoretically, a program might have to deal with
> >> any and all encodings out there. Realistically, there's probably a dozen
> >> or two that are relevant. You'd need empirical data.
> I have looked at the charsets in all my email, but the results are
> thrown by the spam.
> > Unfortunately I need all supported by MIME.
> Falling back using e.g. iconv() for the otherwise-unsupported ones is
> my plan.
That's good enough to me.
-- Felipe Magno de Almeida
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk