Boost logo

Boost :

From: Graham (Graham_at_[hidden])
Date: 2005-07-28 13:28:58


>From: Rogier van Dalen <rogiervd_at_[hidden]>
>Subject: Re: [boost] Call for interest for native unicode character
> and string support in boost
>
>Great, this seems a good first step. Glad to see things moving. I'll
>give my comments, but I hope Erik will step in so we can see what he's
>got.
>
>> I welcome comments.
>>
>>I agree with the general idea.
>>First, http://www.boost.org/more/lib_guide.htm#Guidelines has coding
>>guidelines. In general, your code looks slightly C-ish. The Boost
>>habit is to use the ".hpp" extension for C++ headers. You attached a
>file "unicode.hpp" but talk about "Unicode.hpp": note that these are
>different names.
>I suggest we make a namespace "unicode" rather than prepending
>everything with "uni". The enums had probably better be put in
>structs.
>
>namespace unicode {
> struct range {
> enum type {
> latin1_supplement,
> latin_extended_a,
> latin_extended_a,
> ipa_extensions,
> // ...
> }
> };
>}
Yes - it should be namespaced - I had omitted it for clarity.
I still think that the uni prefix might be useful to remind those
programmers using 'using unicode' that these are Unicode functions - but
I am happy to lose that argument.

>The fact that I find "Hungarian notation" ugly and meaningless is
>probably irrelevant, but it's not the way it's generally done in
>Boost.
>char32_t is not yet a part of the C++ standard, I believe. I'm not
>sure, maybe we'd better call it "codepoint" anyway, and use #ifdef'ed
>typedef's.
>BOOL is not C++; it is spelled "bool". DWORD doesn't exist either; I
>believe you mean uint16_t (sic) for the collation data, if I
>understand correctly what the methods are doing. But I think collation
>should not be in this header yet, but rather be inserted later, when
>the string classes are defined.

Oops - caught - I was attempting to write it in such a way that it could
be used from C as well as C++ - hence BOOL not bool.
DWORD is actually uint32_t.
I believe collation must be here as there will be probably be several
containers with Unicode characteristics and this is a good level for
them to work on.

>Case conversion should probably take output iterators. That'll get rid
>of the complex/simple division. The methods should probably be
>templated as well, and take ranges rather than counts.
>template <class InputIterator, class Outputiterator>
> lowercase (InputIterator first, InputIterator last, OutputIterator
>result);
>template <class InputIterator, class Outputiterator>
> uppercase (InputIterator first, InputIterator last, OutputIterator
>result);
I like this but we will still need to have a complex/simple division.
However using iterators the complex can do both, and the simple then
becomes GetSimpleLowercase for case conversion without changing length,
but it can again take an output iterator.

>The break functions:
>Couldn't these take iterators as well? For all use cases I can think
>of, this would be a much easier version to use:
>template <class InputIterator>
> InputIterator advance_grapheme (InputIterator position,
InputIterator >last);
>(etc.)
When I did my original coding I coding each of following:
GetStartOfGrapheme
GetPreviousGrapheme
GetNextGrapheme
I found that just be having IsStartOfGrapheme all these became really
simple routines.
I therefore believe extremely strongly that it is necessary to have
StartOfGrapheme and that the others like GetNextGrapheme or
advancegrapheme will then be simple/inline wrappers that use
StartOfGrapheme.
I also found that there was a coding hit if you have to test start and
end iterator positions when processing the grapheme, hence I was passing
in three DWORDs.
Having said that, allowing inline versions that take iterators to call
the core uint32_t/ [DWORD] functions would a good thing, and I would
expect this to happen.

>Finally, just thinking out loud: both the case mappings and collation
>have default (non-locale-specific) and tailored modes. Shouldn't those
>best be represented by classes rather than free functions, and
>shouldn't there thus be a global variable "default" that provides
>default operations, and other objects for locale-specific operations?

Unicode case mappings are locale inspecific.

I do not intend to handle any code page conversions at this stage - that
can be added on later and should be handled separately in a separate
discussion. Those conversions would not be Unicode conversions and I
believe that discussion should be postponed for a later date.

Yours,

Graham


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk