Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2005-07-30 17:45:04


Hi Graham,

There should be a message on your most recent header proposal on the
list; if you don't see it, please let me know.

On 7/30/05, Graham <Graham_at_[hidden]> wrote:
> ...
> >I know, but - correct me if I'm wrong - I think the sort keys consist
> of 16-bit integers.
>
> This depends on how we implement it - and if we implement or allow for
> more than 4 level comparison - if so the 5+ comparison if coded may
> require uint32_t.
>
> What do you think? I am happy to go to uint16_t and limit to 4 levels as
> this is a reasonable performance/ implementation trade-off that I have
> previously used with success.

I hadn't thought of that, actually. But as for the sort keys: I was
actually thinking of packing the key *with* the string, so that the
string itself can be used as a tie-breaker (cos that's what you need
the 32 bits for, isn't it?).
class string_with_sort_key
{
    const string s;
    std::vector <uint16_t> sort_key;
public:
    string_with_sort_key (const string &);
};
This would make a good key for, say, std::map: it woul implicitly
converting from string when required, e.g. when calling "find" with a
string. I think the string itself would be needed in case another
locale is required, and as a tie-breaker, but I haven't fully thought
this through yet. What do you think about this?

> >done without producing sort keys.
>
> String equality [=] would be handled directly using on the fly
> comparison of canonically decomposed forms as I previously mentioned,
> and as you correctly mention.
>
> String equivalence [<] would be handled using sort level 4+.

I'd rather use binary comparison for the normal operator< and
operator==: that's what the Unicode standard recommends for speed
(5.18). For UTF-8 and UTF-32 it's as fast as normal string comparison,
and for UTF-16 it's only slightly slower. C++ has this habit of
passing comparison objects (like std::less) to algorithms and
containers that would nicely fit in with the need for locales.
Basically, a "collate" object would contain the information needed to
compare strings for a given locale on a certain level.

> >Not being a native speaker of English I don't think I understand what a
> "coding hit" is.
>
> Reduction in performance - your English is better than a lot of the
> British !

Being a linguist (here meaning someone who knows things about
linguistics) I should be of the opinion that written language isn't
really language, so that doesn't really count - but thank you anyway!
:-)

> >I'm assuming that the caller usually knows the start and end. (I'm
> obviously assuming a C++-style string and not a C-style
> >zero-terminated one.) In this case, the caller must check whether the
> end has been reached and whether it's at begin,
> >pass in 0 in those cases, and then start_of_grapheme has to check
> whether 0 was passed in. To me it
> >seem this would be slower, both to code and to execute.
>
> Performance can be argued back and forth, but when I looked at
> optimising the performance I needed three iterators as iterating
> graphemes is not a 'cheap' operation - and I moved the iterators along
> in series [step forward is curr->prev, next->curr, calculate next]. In
> 64 bit code iterating a uint_32t container, iterators can introduce high
> costs.

I'm not sure I see what you mean, sorry. I'd probably have to see the
code. If I am correct, however, for advancing to the next grapheme
boundary you don't need a "previous" iterator, because it only has a
look-back (if that's the antonym of lookahead) of one character max.

> The real problem though is that iterators do not work across DLL
> boundaries unless you have the code for the DLL and the caller. Have a
> look at my recently posted implementation. This implementation proposal
> would mean you can produce a DLL for sale that has a custom Unicode
> control in it - without having to have the source code for the iterators
> that it needs and providing an interface for each iterator type, or
> placing the whole of the Unicode data in that DLL and preventing the use
> of custom characters.

Whoops... I'm only now starting to see that we probably have slightly
different headers in mind: I'm thinking of what the user needs, you're
thinking of what the "database" needs... sorry I didn't realise that.
Obviously, the DLL shouldn't use iterators, and should probably have,
well ermmm, just about the interface you propose. Hopefully this
realisation will get us somewhere because I'm sure any iterator type
can be used for processing, even if the database returns plain
pointers. It'll probably mean the database interface should be as
low-level as possible and any smarts should be outside it. These
should definitely be different header files, too...
For example, I think the database (DLL) should provide the
Grapheme_Cluster_Break property for every character; the actual
algorithm could be templated, allowing any iterator to be used.

> ...
> > (I don't think "next" is
> needed for the grapheme_cluster case anyway.)

Correcting myself: I meant to say "previous" rather than "next", but see up.

> ...
> >Ermmm... what's this doing then on p. 137 of Unicode standard 4.0.0?
>
> >"Characters may have case mappings that depend on the locale. The
> principal example is Turkish, ..."
>
> Ouch - I had not seen that before.
>
> I believe that this would mean that the upper/ lower case conversions
> will need to take a locale to allow for these special extensions at a
> later date.
>
> I have not seen any written rules for this - I guess because I had never
> worked with Turkish - but it should be allowed for.
>
> Does this mean that sorting will also require a locale?

In my latest header proposal, I've grouped both functions under one
"locale" object, and I think this would be fast, allow for (later)
adding tailoring for language, and yield a reasonable syntax. What do
you think?

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk