Boost logo

Boost :

From: Graham (Graham_at_[hidden])
Date: 2005-07-30 11:05:26


>From: Rogier van Dalen <rogiervd_at_[hidden]>

>Subject: Re: [boost] Call for interest for native unicode character and
string support in boost

 

Dear Rogier,

 

>> DWORD is actually uint32_t.

>I know, but - correct me if I'm wrong - I think the sort keys consist
of 16-bit integers.

This depends on how we implement it - and if we implement or allow for
more than 4 level comparison - if so the 5+ comparison if coded may
require uint32_t.

What do you think? I am happy to go to uint16_t and limit to 4 levels as
this is a reasonable performance/ implementation trade-off that I have
previously used with success.

 

>> I believe collation must be here as there will be probably be several

>> containers with Unicode characteristics and this is a good level for

>> them to work on.

>I'm not sure - seeing as I'd expect us to come up with at least the
option to keep strings in some Normalisation Form, the

>first step of the collation algorithm should be skipped in some cases.
Furthermore, individual comparisons should probably be

>done without producing sort keys.

String equality [=] would be handled directly using on the fly
comparison of canonically decomposed forms as I previously mentioned,
and as you correctly mention.

String equivalence [<] would be handled using sort level 4+.

 

>Not being a native speaker of English I don't think I understand what a
"coding hit" is.

Reduction in performance - your English is better than a lot of the
British !

 

>I'm assuming that the caller usually knows the start and end. (I'm
obviously assuming a C++-style string and not a C-style

>zero-terminated one.) In this case, the caller must check whether the
end has been reached and whether it's at begin,

>pass in 0 in those cases, and then start_of_grapheme has to check
whether 0 was passed in. To me it

>seem this would be slower, both to code and to execute.

Performance can be argued back and forth, but when I looked at
optimising the performance I needed three iterators as iterating
graphemes is not a 'cheap' operation - and I moved the iterators along
in series [step forward is curr->prev, next->curr, calculate next]. In
64 bit code iterating a uint_32t container, iterators can introduce high
costs.

The real problem though is that iterators do not work across DLL
boundaries unless you have the code for the DLL and the caller. Have a
look at my recently posted implementation. This implementation proposal
would mean you can produce a DLL for sale that has a custom Unicode
control in it - without having to have the source code for the iterators
that it needs and providing an interface for each iterator type, or
placing the whole of the Unicode data in that DLL and preventing the use
of custom characters.

 

>Furthermore, one'd better abstract away from the implementation as much
as possible, and I don't think it is relevant to the caller

>how far the functions want to look ahead. (I don't think "next" is
needed for the grapheme_cluster case anyway.)

A good example of next grapheme and start of grapheme is for default
moving a cursor left/ right - the cursor may be on a grapheme [or
optionally inside a grapheme if the editor supports this, as is
optionally recommended in the standard] and you need to find the
previous/next grapheme.

 

>Ermmm... what's this doing then on p. 137 of Unicode standard 4.0.0?

>"Characters may have case mappings that depend on the locale. The
principal example is Turkish, ..."

Ouch - I had not seen that before.

I believe that this would mean that the upper/ lower case conversions
will need to take a locale to allow for these special extensions at a
later date.

I have not seen any written rules for this - I guess because I had never
worked with Turkish - but it should be allowed for.

Does this mean that sorting will also require a locale?

Please let me know what you think and if you can spot any other
assumptions I have made in my attachment posted on the 28th.

 

Yours,

 

Graham


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk