Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-09-24 13:08:03


Sebastian Redl wrote:
> Phil Endecott wrote:
>> Dear All,
>>
>> Something that I have been thinking about for a while is storing
>> strings tagged with their character set. Since I now have a practical
>> need for this I plan to try to implement something. Your feedback
>> would be appreciated.
>>
> Hi,
>
> I've played around with this concept a lot already. I basically think
> that encoding-bound strings are a MUST for proper, safe,
> internationalized string handling. Everything else, in particular the
> current situation, is a mess.
>
> If you want, I can package up what I've done so far (not really much,
> but a lot of comments containing concepts) and put it somewhere.

Yes please.

> One thing: I think runtime-tagged strings are useless. Programming
> should happen with one or at most two fixed encodings, known at compile
> time. Because of the differences in behaviour in encodings (base unit 8,
> 16 or 32 bits, or 8 with various endians, fixed-length encodings vs
> variable-length encodings, ...), it is not good to write a type handling
> them all at runtime. I think that runtime-specified string conversion
> should be an I/O question. In other words, when character data enters
> your program, you convert it to the encoding you use internally, when it
> leaves the program, you convert it to an external encoding. In-between,
> you use whatever your program uses, and you specify it at compile time.

Consider processing a MIME email. It may have several parts each with
a different character set. I would imagine a flow something like this:

read in message as a sequence-of-bytes
for each message part {
   find the character set
   put the body in a run-time-tagged string
   do something with the body
}

Now, "do something with the body" might be "save it in a file", i.e.

f << "content-type: text/plain; charset=\"" << body.charset << "\"\n"
   << "\n";
   << body.data;

In this case, it would be wasteful to convert to and from a
compile-time-fixed character set.

On the other hand, "do something with the body" might be "search for
<string>". In this case, converting to a compile-time-fixed character
set, preferably a universal one, would be best:

ucs4string body_ucs4 = body.data; // if we have implicit conversion...
body_ucs4.find("hello");

What I'm saying is: yes, good practice is very often to convert to a
fixed character set before doing anything to the data; but no I don't
think that that can happen exclusively inside an I/O layer. So some
method of representing run-time-tagged data - if only temporarily,
before conversion - is needed.

> I'd be willing to cooperate on this project, too. I'm mostly busy with
> my new I/O stuff, but the tagged strings form the foundation of the text
> I/O part, so I need the character library sooner or later anyway.

I have a small project in progress which needs a subset of this
functionality, and I'm planning to use it as a testbed for these
ideas. I'll post again when I have something more concrete. The area
where I would most appreciate some input is in how to provide a
"user-extensible enum or type tag" for character sets.

Regards,

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk