Boost logo

Boost Users :

From: Darren Cook (darren_at_[hidden])
Date: 2003-12-14 18:27:09


>>Are the existing character-classes following a standard, or are you open
>> to patches to extend them?
>
> Yes, they follow the POSIX and ECMA script standards to give:
>...

>>It might be nice to have at least:
>> [:hiragana:]
>> [:katakana:]
>> [:hankaku_katakana:]
>
> isn't that just [[:hiragana:][:katakana:]] ?

"hankaku" is half-width characters. In Shift-JIS they are encoded with a
single byte. In unicode they are encoded in a different block to normal
katakana.

>> [:wide_alpha:]
>> [:wide_num:]
>> [:wide_alphanum:]
>
> There should be no need for those - [[:alpha:]] will detect wide character
> alphabetic characters perfectly well (provided the locale isn't "C").

That sounds like it has potential for problems so I wonder if we're talking
about the same thing; by wide I mean the character occupies the same amount
of screen space as a kanji, which is twice as wide as the ascii character.

E.g. "A1" rather than "A1".

However my main use case is not so much detecting with regex as converting
them to ascii; e.g. given a list of email addresses, some of which the user
has typed in with their Japanese IME still switched on. I'll convert to
ascii then run the email address through a regex.

>>Defining the set of Japanese kanji would be harder.
>
> How are they defined?

The Japanese kanji can be thought of a subset of Chinese, so the issue is
where the subset ends, and there are various definitions. Joyo kanji is
approx 2000 common ones, but people names often use others, and academics
use more. I've not looked but it is possible Joyo kanji may be scattered
around unicode.

Simplest would be to define [:kanji:] as all Chinese characters, and anyone
needing to distinguish Japanese from Chinese could then use a lookup table.

> It might be best to add a facility to add new character classes as a list of
> characters and ranges to include, something like:
>
> register_character_class("myname", "d-f");
>
> Then we add all the Unicode block ranges as standard for wide character
> regexes.

Sounds good. Do you mean in an extra include file, e.g.
"regex/unicode_classes.h" ?

Darren


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net