|
Boost : |
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-09-24 13:08:03
Sebastian Redl wrote:
> Phil Endecott wrote:
>> Dear All,
>>
>> Something that I have been thinking about for a while is storing
>> strings tagged with their character set. Since I now have a practical
>> need for this I plan to try to implement something. Your feedback
>> would be appreciated.
>>
> Hi,
>
> I've played around with this concept a lot already. I basically think
> that encoding-bound strings are a MUST for proper, safe,
> internationalized string handling. Everything else, in particular the
> current situation, is a mess.
>
> If you want, I can package up what I've done so far (not really much,
> but a lot of comments containing concepts) and put it somewhere.
Yes please.
> One thing: I think runtime-tagged strings are useless. Programming
> should happen with one or at most two fixed encodings, known at compile
> time. Because of the differences in behaviour in encodings (base unit 8,
> 16 or 32 bits, or 8 with various endians, fixed-length encodings vs
> variable-length encodings, ...), it is not good to write a type handling
> them all at runtime. I think that runtime-specified string conversion
> should be an I/O question. In other words, when character data enters
> your program, you convert it to the encoding you use internally, when it
> leaves the program, you convert it to an external encoding. In-between,
> you use whatever your program uses, and you specify it at compile time.
Consider processing a MIME email. It may have several parts each with
a different character set. I would imagine a flow something like this:
read in message as a sequence-of-bytes
for each message part {
find the character set
put the body in a run-time-tagged string
do something with the body
}
Now, "do something with the body" might be "save it in a file", i.e.
f << "content-type: text/plain; charset=\"" << body.charset << "\"\n"
<< "\n";
<< body.data;
In this case, it would be wasteful to convert to and from a
compile-time-fixed character set.
On the other hand, "do something with the body" might be "search for
<string>". In this case, converting to a compile-time-fixed character
set, preferably a universal one, would be best:
ucs4string body_ucs4 = body.data; // if we have implicit conversion...
body_ucs4.find("hello");
What I'm saying is: yes, good practice is very often to convert to a
fixed character set before doing anything to the data; but no I don't
think that that can happen exclusively inside an I/O layer. So some
method of representing run-time-tagged data - if only temporarily,
before conversion - is needed.
> I'd be willing to cooperate on this project, too. I'm mostly busy with
> my new I/O stuff, but the tagged strings form the foundation of the text
> I/O part, so I need the character library sooner or later anyway.
I have a small project in progress which needs a subset of this
functionality, and I'm planning to use it as a testbed for these
ideas. I'll post again when I have something more concrete. The area
where I would most appreciate some input is in how to provide a
"user-extensible enum or type tag" for character sets.
Regards,
Phil.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk