Boost logo

Boost :

From: Sean Parent (sparent_at_[hidden])
Date: 2002-04-07 00:17:47


If lexical cast always stays confined to 7 bit ASCII then it can be
considered UTF-8 encoded and there isn't a need for a
lexical_cast<wstring>() for unicode support.

On this topic, the text handling in the entire C++ standard is seriously
broken for international text. Most compilers treat wchar_t as 16 bits which
isn't sufficient for full unicode text handlings (just the UCS-2 subset).
Even Unicode is a bit insufficient these days given China's threat of
denying sale to applications that don't support GB-18030 which hasn't been
fully incorporated into the Unicode standard - and which also require 32 bit
for full support.

What I would like to see happen (this is just a rough idea - I haven't taken
the time to write this up as any kind of proposal) is a new string class
template added that can handle multi-word encoded text. Untagged strings
would be assumed to be encoded in UTF-8 (for string, a multi-word encoding
that's a superset of 7 bit ASCII) or UTF-16 (for wstring, a multi-word
encoding that supersets UCS-2). Processing could then be encoding - aware
dealing with linguistic characters instead of bytes.

An encoding cast would handle the translations so the processor could either
be in the same encoding as the string or in one reachable by an encoding
cast.

This would be sufficient for handling the basic processing of text but would
still require additions if you wanted to handle things like contextual based
text substitution.

Sean

on 4/5/02 11:54 PM, Mattias Flodin at flodin_at_[hidden] wrote:

> When you do, please try not to leave the Unicode question out of the
> discussion. A separate name such as to_wstring seems insufficient to me
> because
> there are times when you cannot know beforehand whether the string type you're
> casting to is single-byte or Unicode (e.g. when writing generic code). I have
> posted a plausible solution (using partial specialization on the contained
> character type) before in
> http://lists.boost.org/MailArchives/boost/msg27288.php
>
> In fact, functionality for directly converting between ASCII and Unicode might
> also be something that people would consider to be a "lexical cast." Perhaps
> this is not something that everyone feels would be of use to the common
> programmer, but to programmers working with international applications I think
> it would. The standard already has half-hearted support for these people. Why
> not aim for making it whole-hearted (certainly, that would take much more work
> than just fiddling with lexical_cast, but it's a start...).
>
> On Fri, Apr 05, 2002 at 09:19:50PM -0500, Beman Dawes wrote:
>> At 05:52 PM 4/3/2002, Jan Langer wrote:
>>
>>>> I also added this to the Wiki page on Lexical Cast for the next time
>>
>>> this comes up....
>>>
>>> although the archive seems to handle some threads incorrectly, i read
>>> through the mentioned messages and there are several requests for
>> this
>>> simple change and there are no serious concerns. the only question is
>>> whether to make it a new function to_string or to specialize
>>> lexical_cast for std::string.
>>> so why isn't it done?
>>>
>>> is there more need for discussion or is there no one who wants to do
>>> the actual work? or does it need a formal review ;-) ?
>>
>> I've started to pester Kevlin Henney. If he isn't going to work on
>> lexical_cast issues, we ought to let someone else have a go at them.
>>
>> But there are tricky issues involved, so I hope anyone working on a
>> proposal will run it by Kevlin for his opinion.
>>
>> --Beman

-- 
Sean Parent
Sr. Computer Scientist II
Advanced Technology Group
Adobe Systems Incorporated
sparent_at_[hidden]

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk