Boost logo

Boost :

Subject: Re: [boost] GSoC Proposal Preparation For Encoding Awared String
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-03-24 11:03:52

On Thu, 24 Mar 2011 11:14:08 +0800
Soares Chen <crf_at_[hidden]> wrote:

> [...] What should be the type for single Unicode combine character and
> grapheme? Unicode combine characters and graphemes (aka the abstract
> characters) can consist of arbitrary number of code points. This means
> that unlike basic types such as char that can be placed on the stack,
> the value for even single abstract character must stay at the heap due
> to it's variable size. [...]

Maybe not. The "Stream-Safe Text Format" is designed specifically for
this. From

    A Unicode string is said to be in Stream-Safe Text Format if it
    would not contain any sequences of non-starters longer than 30
    characters in length when normalized to NFKD.

    Such a string can be normalized in buffered serialization with a
    buffer size of 32 characters, which would require no more than 128
    bytes in any Unicode Encoding Form.

It might be feasible to require graphemes to be in this format. I was
planning to do so if I ever wrote a grapheme iterator. Of course, it
still might not be feasible to use a fixed-size structure for
graphemes, depending on how many you need to store at once, but for an
iterator it would be reasonable.

Chad Nelson
Oak Circle Software, Inc.

Boost list run by bdawes at, gregod at, cpdaniel at, john at