|
Boost Users : |
From: John Maddock (john_at_[hidden])
Date: 2003-12-15 06:53:11
> >>It might be nice to have at least:
> >> [:hiragana:]
> >> [:katakana:]
> >> [:hankaku_katakana:]
> >
> > isn't that just [[:hiragana:][:katakana:]] ?
>
> "hankaku" is half-width characters. In Shift-JIS they are encoded with a
> single byte. In unicode they are encoded in a different block to normal
> katakana.
Sorry I misread your original.
> >> [:wide_alpha:]
> >> [:wide_num:]
> >> [:wide_alphanum:]
> >
> > There should be no need for those - [[:alpha:]] will detect wide
character
> > alphabetic characters perfectly well (provided the locale isn't "C").
>
> That sounds like it has potential for problems so I wonder if we're
talking
> about the same thing; by wide I mean the character occupies the same
amount
> of screen space as a kanji, which is twice as wide as the ascii character.
>
> E.g. "A1" rather than "A1".
OK, do you mean what Unicode calls "Full Width" rather than "Half Width"?
> However my main use case is not so much detecting with regex as converting
> them to ascii; e.g. given a list of email addresses, some of which the
user
> has typed in with their Japanese IME still switched on. I'll convert to
> ascii then run the email address through a regex.
>
> >>Defining the set of Japanese kanji would be harder.
> >
> > How are they defined?
>
> The Japanese kanji can be thought of a subset of Chinese, so the issue is
> where the subset ends, and there are various definitions. Joyo kanji is
> approx 2000 common ones, but people names often use others, and academics
> use more. I've not looked but it is possible Joyo kanji may be scattered
> around unicode.
>
> Simplest would be to define [:kanji:] as all Chinese characters, and
anyone
> needing to distinguish Japanese from Chinese could then use a lookup
table.
OK, I'm beginning to regret asking :-)
> > It might be best to add a facility to add new character classes as a
list of
> > characters and ranges to include, something like:
> >
> > register_character_class("myname", "d-f");
> >
> > Then we add all the Unicode block ranges as standard for wide character
> > regexes.
>
> Sounds good. Do you mean in an extra include file, e.g.
> "regex/unicode_classes.h" ?
To be honest I haven't decided, I guess in the spirit of "only pay for what
you use" that would be the best way.
John.
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net