Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2005-07-30 07:29:03


Hi,

On 7/28/05, Graham <Graham_at_[hidden]> wrote:
> ...
> Notable Changes From Unicode 4.0.1 to Unicode 4.1.0
>
>
> * Addition of 1273 new characters to the standard, including those
> to complete roundtrip mapping of the HKSCS and GB 18030 standards, five
> new currency signs, some characters for Indic and Korean, and eight new
> scripts. (The exact list of additions can be seen in DerivedAge.txt, in
> the age=4.1 section.)
> * Change in the end of the CJK Unified Ideographs range from
> U+9FA5 to U+9FBB, with the addition of some Han characters. The
> boundaries of such ranges are sometimes hardcoded in software, in which
> case the hardcoded value needs to be changed.

I see. Things like these may indeed be problematic when, say, the
operating system uses another version than the library. However, do we
have any alternative but to accept that this may lead to problems?

> ...
> >I fear I don't understand what you mean. It sounds to me like you're
> >suggesting defining a new font format for the Boost Unicode library.
>
> Please accept that there are fundamental issues with the way in which
> scaling happens on a Windows platform that means that anybody who want
> to ensure that there are the same number of lines shown in a document
> regardless of the 'zoom' setting used to display that document so that
> the lines appear to have the same relative lengths at any zoom setting
> have to work very hard.

Having done a fair share of font development, I know. However, this is
not a Windows-specific problem. Computer screens' resolutions jut
aren't high enough to both display readable text and position
characters at an unrounded position. I'm not sure there is much to be
done about it though, in a Unicode implementation.

> This might be an example of where a particular implementation might
> store an array of wrapped lines rather than an array of characters.
>
> By moving the Unicode specific rules out of, and separate to, the
> storage, the storage method becomes an implementation issue.
>
> There can then be many different storage methods based on the same
> Unicode rules, e.g. implementations based on vector<char32_t> and
> vector<char16_t> to give two examples.

This sounds reasonable, but I'm having trouble seeing what you mean to
suggest. Is it that you'd want to give users the choice of an
encoding?

> I would envision that the string storage classes would support UTF16,
> UTF32 and grapheme based iterators. It is important in some
> circumstances to be able to process UTF32 characters rather than
> graphemes - for example when drawing the string !
>
> As graphemes are can be several UTF32 characters long and require
> calculation to determine their data length it is often not practical to
> make these the default processing scheme.

Have you read the discussion on this list some months ago? I suggested
a grapheme-based string containing a codepoint string. A codepoints()
method should give the user the codepoint string, so that it can be
used to display or other I/O, or whatever.
It's just that you don't want to force people to deal with
normalisation forms and what not if it can be helped. Equivalent
sequences should work the same as far as the user is concerned,
independent of its normalisation form or encoding. If you really want
to process UTF-32 codepoints, or UTF-8 bytes, for that matter, you
should be able to get at them, but how many people do you think would
need that? A very small fraction, I think. Thus, the default interface
should use graphemes.

> >> This private use character data would NOT be published or distributed
> -
> >> the facility to merge them in during usage allows each developer the
> >> access to add their own private use data for their own system only.
> >
>
> >But surely this means every app would have to come with a different
> DLL?
> >I'm not so sure about this. For many cases other markup (XML or
> >something) would do. Maybe other people have opinions about this?
>
> I propose making it so that the Unicode data could be either placed in a
> DLL or an application.
> ...

OK, this may be possible. Thinking from a possible standardisation
point of view, however, this would seem quite impossible. Especially
since the fraction of people wanting to introduce their own characters
will be small, I think it would be best to focus on what the interface
should look like first, and then to see how extra characters can be
added.

(I'll reply to your new header mail in a minute.)

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk