Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2008-02-25 16:06:39


Felipe Magno de Almeida wrote:
> On Mon, Feb 25, 2008 at 8:09 AM, Sebastian Redl
> <sebastian.redl_at_[hidden]> wrote:
>> Phil Endecott wrote:
>> > Things I'd appreciate feedback on:
>> > - What should the cs_string look like? Basically everywhere that
>> > std::string uses an integer position I have the choice of a character
>> > position, a unit position, or an iterator - or not providing that function.
>>
>> I think emulating std::string doesn't work. It has a naive design based
>> on the assumption of fixed-width encodings. I think that a tagged string
>> is the best place to really start over with a string design and produce
>> a string that is lean, rather than bloated.
>
> I agree.

Hmmm. I hear what you're saying, but things that are too revolutionary
don't get used because they're too different from what people are used
to. I'd like to offer something that's close to a drop-in replacement
for std::string that will let people painlessly upgrade their code to
proper character set support.

However, most of the work that I have done has been at a lower level
and can be easily built upon to enable a new class with a different
interface as well. So you can have your cake and eat it! Comments
about both are welcome.

>> I think the string type should offer minimal manipulation facilities -
>> either completely read-only or append as the only manipulation function.
>
> I would like to have at least a modifiable string. But only through
> iterators (insert and erase).
> That should suffice all my algorithm needs.

Try this: temporarily replace all your strings with list<character> and
see what's missing.

>> A string buffer type could be written as a mutable alternative, as is
>> the design in Java and C#. However, I'm not sure how much of that
>> interface is needed, either.

I'm unfamiliar with what Java and C# do, but my lower-level code (e.g.
character_output_iterator) make it simple to write e.g. UTF-8 into
arbitrary memory.

> A modifiable iterator interface (with insert and erase) is, IMO, as
> concise and extensible as possible.
>
>> I'd love to have some empirical data on string usage.
>
> I do some string manipulations on email. And it is usually better to
> do all manipulations in the codepage received, instead of converting
> back and forth.

One issue that I'm currently thinking about with this sort of usage is
compile-time character set tagging vs. run-time character set tagging.
In fact, I've been wondering whether there is some general pattern for
providing both e.g.

template <charset_t cset> void foo(int x);
and
void foo(charset_t cset, int x);

You can obviously forward from the first to the second but that may
lose some compile-time-constant optimisations; forwarding from the
second to the first needs a horrible case statement. I was wondering
about a macro that would define both.... any ideas anyone?

>> > - What character sets are people interested in using (a) at the "edges"
>> > of their programs,
>> As many as possible. Theoretically, a program might have to deal with
>> any and all encodings out there. Realistically, there's probably a dozen
>> or two that are relevant. You'd need empirical data.

I have looked at the charsets in all my email, but the results are
thrown by the spam.

> Unfortunately I need all supported by MIME.

Falling back using e.g. iconv() for the otherwise-unsupported ones is
my plan.

I'm unlikely to have the energy to write code for more than a couple of
the exotic sets myself. If anyone would like to help, please get in touch.

>> > and (b) in the "core"?
>> >
>> ASCII, UTF-8 and UTF-16.
>
> ISO-8859-1 ?

Cheers,

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk