Boost logo

Boost :

From: Graham (Graham_at_[hidden])
Date: 2005-07-30 18:14:27


>From: Rogier van Dalen <rogiervd_at_[hidden]>

>Subject: Re: [boost] Call for interest for native unicode character and
string support in boost - updated definition

 

>Hi Graham,

 

>I fear we might be heading off in a new direction here. I think
enabling third parties to add codepoints,

>if desirable at all (I still have my sincere doubts) should not be a
primary concern.

>I propose we focus on the interface first, and leave out the
implementation bits.

I have been working with Unicode for many years and cannot envisage any
serious Unicode implementation where private characters are not
required.

I have given examples: Private characters like buttons, like wrapping
markers for Thai, streaming format changes, etc. etc.

Each developer will need to be able to customise it - after all how can
you support Unicode but decide to ignore private characters that are
part of the Unicode spec !

 

>Discussing the implementation, I'm afraid, will just make the
discussion unclear, as I'll show by pointing out some things once.

>- as for the "functions" struct, do you realise that C++ has a built-in
feature called "virtual functions" to do things like this?

Interesting you raise this point.

I started off as an interface but hit a major problem - performance. I
tried wrapping 100,000 words and the interface took a hell of a hit due
to the way in which the compiled code takes the interface and then
offsets to the member to call a function on the interface.

So I looked at being able to access the function on the interface
directly - this proved to be impossible via a static function and you
cannot have statics being implemented on an interface and no better if
you wrap an interface call using mem_fun.

Then I considered how I would modify one function, like sort, without
ANY performance hit - and ended up with what I had.

 

How the hell can you do it more neatly without a performance hit? I am
open to suggestions.

Do not forget we must be able to optimise this for times when this
interface will be hit millions of times per second - and yes this is
something I have done a number of times in real life - e.g. formatting
of live data feeds - and this must be a primary concern in the design.

 

- Have you realised what happens when deque<>::iterators, or any other
iterators to containers with non-contiguous elements, are fed into the
get_uppercase method with the current implementation? What about
iterators that process UTF-8 and pretend it to be a UTF-32 sequence?

The implementation will work on any non-contiguous iterator as written,
after all the single iterator version increments/ decrements the
iterator then de-refs it.

Of course from bitter experience I can tell you that working with a
single iterator and not caching the values is a massive performance hit.

 

- Do you realise that inline functions are not macros and thus need no
backslashes at the end of lines?

Sorry - I went through a number of iterations to try and get the code
you saw - in one of the iterations I tried to inline the functions.

 

>The implementation won't help discussing the interface, so I think we'd
better leave them out for now

I am actually finding that the snapshots I have included are important
as they can help to reveal why the implementation might not work and
what we can do about it.

 

>

>BTW, now looking at the code: do you fully realise how iterators work?

>It seems to me that

>StartOfGrapheme(functions* pFns, inputIterator chPrev, inputIterator
ch, inputIterator chNext) is not really needed,

>because it is quite easy to find an iterator to the next and previous
position given the current one.

That is correct(ish) - but processor intensive - I have had real
problems when you process a document from a data feed and wrap it for
storage in a database. As an example on a ".4 Gig Pentium 4 I have seen
this take up to six seconds [depending on the text - and we were
handling some large text documents] when fully optimised - and that
required a lot of profiling to get the best performance. Wrapping a full
document for the first time requires a lot of processing and you cannot
just increment and decrement iterators - as I said in my previous e-mail
[the one today, after the e-mail to which you have replied] the only way
of doing this with reasonable performance is to have three iterators
that get shuffled along. Remember these are really low level functions
and get hit very heavily and often.

 

>Please find attached a modified version of the header. Changes are:

>- It now looks like C++: superfluous semicolons are deleted and
identifiers are lowercase with underscores;

This will mean altering the automatic conversion of the Unicode names to
this form C++ names but this should not be a problem stripping out the
upper case a replacing spaces and - with _.

 

>- the last "uni" prefixes have been removed;

>- it does not show the implementation any more;

Unfortunately you seem to have fallen into the great trap of iterators.
The code can no longer be used to/ from third party DLLs, please see my
post of earlier today.

Iterators do not work unless you have all the source code - which means
that third party DLLs containing custom controls will not work with your
changes.

This would be a major step backwards as grid controls are a major
business and many are shipped without source, just as DLLs with a
defined interface.

You cannot use an iterator across such a boundary - please see my
previous code for an example of how I have ensured that it WILL work
with third party DLLs such as grid controls without every DLL from a
different company having to have it's own Unicode data. The
implementation shown demonstrated how to do this and was included for
that reason.

 

>- I corrected some spelling errors, possibly introduced new ones
(seperable is spelled sepArable, isn't it? I'm not 100% sure);

>- a get_category() function is defined to get the general General
Category (i.e. letter, mark, number, etc.)

I have just realised that all the category enums need to be placed in a
single enum as they are in fact in a single data space so your
get_category becomes the single category call.

 

>- I deleted the page0() function because I didn't see why it should be
there (feel free to move it back in if I missed something).

OK

 

>- I provided only iterator-based grapheme, word and sentence skipping
functions;

This will not work across third party code when you don't have the
source. Means you cant have any more grid control DLLs etc in Unicode
unless they contain all the Unicode data separately or you insist that
all suppliers have to sell their DLLs as source code.

 

>- I provided a locale with a few lines sketching an idea of what a
collation object should look like. Did you realise that string
comparison should likely be passed to a function or

>container as a comparison object? For example, my specification would
allow:

>

>// Some container c of strings

>// Some string s

>std::lower_bound (c.begin(), c.end(), s,
unicode::default_locale().collate_accents());

>

>Any comments are, of course, most welcome!

We need to make a collation object that is container independent as
there are going to be several Unicode containers - not just one.

We must therefore separate collation from container.

It must work with third party DLLs where you don't have the source code
so iterators are out.

The collate_accents call is strange - I don't understand why you don't
just pass a locale in. For example some languages sort <ae> differently
and this is not an accent.

I would therefore just pass in a uint32_t for the locale.

 

Any thoughts?

 

Yours,

 

Graham

 

 

 

 

 

 

 


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk