|
Boost : |
Subject: Re: [boost] GSoC Proposal Preparation For Encoding Awared String
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-03-24 11:03:52
On Thu, 24 Mar 2011 11:14:08 +0800
Soares Chen <crf_at_[hidden]> wrote:
> [...] What should be the type for single Unicode combine character and
> grapheme? Unicode combine characters and graphemes (aka the abstract
> characters) can consist of arbitrary number of code points. This means
> that unlike basic types such as char that can be placed on the stack,
> the value for even single abstract character must stay at the heap due
> to it's variable size. [...]
Maybe not. The "Stream-Safe Text Format" is designed specifically for
this. From
<http://www.unicode.org/reports/tr15/index.html#Stream_Safe_Text_Format>:
A Unicode string is said to be in Stream-Safe Text Format if it
would not contain any sequences of non-starters longer than 30
characters in length when normalized to NFKD.
Such a string can be normalized in buffered serialization with a
buffer size of 32 characters, which would require no more than 128
bytes in any Unicode Encoding Form.
It might be feasible to require graphemes to be in this format. I was
planning to do so if I ever wrote a grapheme iterator. Of course, it
still might not be feasible to use a fixed-size structure for
graphemes, depending on how many you need to store at once, but for an
iterator it would be reasonable.
-- Chad Nelson Oak Circle Software, Inc. * * *
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk