Boost logo

Boost :

From: Graham (Graham_at_[hidden])
Date: 2008-08-28 17:59:53


>> Martin Lutken wrote:
>>> Anyone who knows how this could be made possible?
>>> I suppose I need a locale facet like the std::ctype, but which works
for
>>> UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a
>table
>>> like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt)
>>> could be used.
>>>
>>
>> This might not work out-of-the-box. StringAlgo lib is designed around
the >sequences
>> od characters. Since UTF-8 have variable character with encoding,
>algotrithms
>> in the library would not work as expected.
>>
>> To make it working, you will need a container with iterators, that
will
>> iterate over meta-characters, not bytes.
>>
>>> If it's better/easier just to convert the string to UTF-32 before
doing >case
>>> insensitive compares, replaces I could live with that.
>>
>> If you meant UTS-32 and you have a corresponding locale
implementation, >than
>> this approach is a viable solution.
>>
>> Sorry, what is UTS-32 ? I tried to Google it: 351 results, with none
of >them
>> looking like char encoding related.
>>
>> I found this article on Wikipedia on UTF-32/UCS-4:
>> http://en.wikipedia.org/wiki/UTF-32
>>
>> Is it not what I need ?
>> I suspect that many people must have ran into similar problems.
Perhaps >we should
>> add a 32 bit string class to Boost. And until I get a better
>understanding, I will
>> keep calling it UTF-32 :-)
>> >
>
>Sorry, I mixed up it a little. I meant UCS-4 a.k.a fixed-width
encoding. I >was not
>aware that UTF-32 id de-facto the same.
>
>Anyway, the statement about usability with StringAlgo still holds. It
can >work with
>any fixed-size encoding, as long as you have the corresponding locales.

>
>It could theoretically work also with variable-with characters,
provided >you
>have a container/localte framework, that allows to operate on
>metacharacters.
>I'm not sure how efficient it will be, though.
>
>Best regards,
>Pavol.
MArtin,

The Unicode library I posted in the vault will do what you want for
arbitrary characters. It allows you to take two UTF-32 unicode strings
and compare them at different comparison levels [e.g. exact, case
insensitive etc], and includes the calls necessary for iterating
characters [which may be more than a single 4 byte character [e.g.
surrogates can be 3 x 'UTF-32' numbers] and then of course the ability
to iterate graphemes.

It will allow you to do a case insensitive comparison using the full
Unicode character library specification.

The only thing we did not have time to do was do the string wrapper
class.

Feel free to work on that !

Thanks.

Yours,

Graham


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk