Boost logo

Boost :

Subject: Re: [boost] GSoC Proposal Preparation For Encoding Awared String
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-03-24 11:03:52


On Thu, 24 Mar 2011 11:14:08 +0800
Soares Chen <crf_at_[hidden]> wrote:

> [...] What should be the type for single Unicode combine character and
> grapheme? Unicode combine characters and graphemes (aka the abstract
> characters) can consist of arbitrary number of code points. This means
> that unlike basic types such as char that can be placed on the stack,
> the value for even single abstract character must stay at the heap due
> to it's variable size. [...]

Maybe not. The "Stream-Safe Text Format" is designed specifically for
this. From
<http://www.unicode.org/reports/tr15/index.html#Stream_Safe_Text_Format>:

    A Unicode string is said to be in Stream-Safe Text Format if it
    would not contain any sequences of non-starters longer than 30
    characters in length when normalized to NFKD.

    Such a string can be normalized in buffered serialization with a
    buffer size of 32 characters, which would require no more than 128
    bytes in any Unicode Encoding Form.

It might be feasible to require graphemes to be in this format. I was
planning to do so if I ever wrote a grapheme iterator. Of course, it
still might not be feasible to use a fixed-size structure for
graphemes, depending on how many you need to store at once, but for an
iterator it would be reasonable.

-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk