Boost logo

Boost :

Subject: Re: [boost] [locale] Review results for Boost.Locale library
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-04-25 11:43:20


> From: Ryo IGARASHI <rigarash_at_[hidden]>
> On Mon, Apr 25, 2011 at 11:06 PM, Gevorg Voskanyan <v_gevorg_at_[hidden]> wrote:
> > OK, that tells case conversions and normalization don't apply to Japanese.
>But
> > what about collation? Isn't there any "dictionary order" defined for
>Japanese
> > words? Just curious.
>
> "Dictionary order" depends on what kind of information in the dictionary.
> For example, we use complex sorting algorithm for 'Kanji' letter dictionary.
>
> However, for language dictionary (Japanese-Japanese dictionary),
> we use pronunciation order. But this is impossible to decide
> by program since each 'Kanji' letter have usually
> 3-4 (sometimes more) completely different pronunciation only to be
> decided by the context in principle.
>
> Just FYI.

These are sizes of collation rules for different
languages in ICU 4.4 by size (top 5):

630641 2010-04-28 18:28 zh.txt
439431 2010-04-28 18:28 ko.txt
438456 2010-04-28 18:28 ja.txt
 23851 2010-04-28 18:28 kn.txt
 23594 2010-04-28 18:28 bn.txt

I've looked into ja.txt file and it includes
a huge dictionary of Kanji letters sorted
by their order.

I can't check it by my own but I assume that
the collation rules for Japanese are not that
simple.

Also there are customization parameters
for collation in locale names like

ja_JP.UTF-8_at_collation=unihan

These are keywords take from:

http://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers

  "big5han" Pinyin ordering for Latin, big5 charset ordering for CJK
characters. (used in Chinese)
  "dict"
  (dictionary) For a dictionary-style ordering (such as in Sinhala)
  "direct" Hindi variant
  "gb2312"
  (gb2312han) Pinyin ordering for Latin, gb2312han charset ordering for CJK
characters. (used in Chinese)
  "phonebk"
  (phonebook) For a phonebook-style ordering (such as in German)
  "phonetic" Requests a phonetic variant if available, where text is sorted
based on pronunciation.
               It may interleave different scripts, if multiple scripts are
in common use.
  "pinyin" Pinyin ordering for Latin and for CJK characters; that is,
               an ordering for CJK characters based on a character-by-character
transliteration

               into a pinyin. (used in Chinese)
  "reformed" Reformed collation (such as in Swedish)
  "search" A special collation type dedicated for string search.
  "stroke" Pinyin ordering for Latin, stroke order for CJK characters (used
in Chinese)
  "trad"
   (traditional) For a traditional-style ordering (such as in Spanish)
  "unihan" Pinyin ordering for Latin, Unihan radical-stroke ordering for
               CJK characters. (used in Chinese)

So I can't check but I can assume it
does something right...

Artyom


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk