Boost :

Date view	Thread view	Subject view	Author view

From: James Porter (porterj_at_[hidden])
Date: 2007-09-27 11:36:25

Next message: Markus Schöpflin: "Re: [boost] [Report] 2992 failures on trunk (2007-09-26)"
Previous message: Robert Ramey: "Re: [boost] [serialization] Proposal for an extension API to the Archive concept"
In reply to: Sebastian Redl: "Re: [boost] Strings tagged with their character set"
Next in thread: Sebastian Redl: "Re: [boost] Strings tagged with their character set"
Reply: Sebastian Redl: "Re: [boost] Strings tagged with their character set"
Reply: Joseph Gauterin: "Re: [boost] Strings tagged with their character set"
Reply: Jeremy Maitin-Shepard: "Re: [boost] Strings tagged with their character set"

I see what you mean. Still, fixed-width-encoded strings are a lot easier to
code, and I think we should focus on them first just to get something
working and to have a platform to test code conversion on, which in my
opinion is the most important part. Without code conversion, it would be
difficult to read in non-ASCII strings in the first place, since
std::wfstream just converts ASCII to UTF-16.

Variable-width-encoded strings should be fairly straightforward when they
are immutable, but will probably get hairy when they can be modified.
Converting a VWE string would probably be no harder than a FWE string.

That said, I think a good (general) roadmap for this project would be:
1) Extend std::basic_string to store UCS-2 / UCS-4 (should be easy, though
string constants may pose a problem)
2) Add code conversion to move between encodings, especially for I/O
3) Create VWE string class (fairly easy if immutable, hard if mutable)

- James

On 9/27/07, Sebastian Redl <sebastian.redl_at_[hidden] > wrote:
>
> James Porter wrote:
> > For certain special purposes (like the one above), a variable-width
> > string class would be useful, but I think we should focus on storing
> > strings in fixed-width encodings and then converting them appropriately
> > during I/O.
> Actually, I disagree with this. The only general-purpose fixed-width
> encoding available is UTF-32, and hardly anyone actually uses it. For
> good reason: for English text, it wastes 75% of the used space. In
> general, it wastes about 10 bits (30%) in everything, because Unicode
> only has about, what, 2^21 code points?

[snip]

I think the problem of UTF-8 and UTF-16 strings is important and must be
> addressed.
>
> Sebastian Redl
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
>
>

Next message: Markus Schöpflin: "Re: [boost] [Report] 2992 failures on trunk (2007-09-26)"
Previous message: Robert Ramey: "Re: [boost] [serialization] Proposal for an extension API to the Archive concept"
In reply to: Sebastian Redl: "Re: [boost] Strings tagged with their character set"
Next in thread: Sebastian Redl: "Re: [boost] Strings tagged with their character set"
Reply: Sebastian Redl: "Re: [boost] Strings tagged with their character set"
Reply: Joseph Gauterin: "Re: [boost] Strings tagged with their character set"
Reply: Jeremy Maitin-Shepard: "Re: [boost] Strings tagged with their character set"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk