Boost logo

Boost :

Subject: Re: [boost] [locale] Review results for Boost.Locale library
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-04-26 07:31:12


On 24/04/2011 22:01, Ryou Ezoe wrote:

> Collation and Conversions:
> Japanese doesn't have concepts of case and accent.
> Since we don't have these concepts, we never need it.

I believe all CJK characters can be decomposed to radicals, which are
equivalent, so you could want to do normalization.

Also, converting between halfwidth and fullwidth katakana could have
some uses.

> Boundary analysis:
> What is the definition of boundary and how does it analyse?
> It sounds too smart for such a small things it actually does.

It uses the boundary analysis algorithms defined by the Unicode
standard, which doesn't use heuristics or anything like that.

Remember Boost.Locale is just a wrapper of ICU, which is the real smart
library.

> I'd rather call it strtok with hard-coded delimiters.
> Japanese doesn't separate each words by space.
> So unless we perform really complicated natural language
> processing(which is impossible to be perfect since we never have
> complete Japanese dictionary),
> we can't split Japanese text by words.
> Also, Japanese doesn't have a concept of word wrap.
> So "find appropriate places for line breaks" is unnecessary.
> Actually, there are some rules for line break in Japanese.

You can still break at punctuation marks, and there are places where you
should definitely not break.

Thai, Lao, Chinese and Japanese do require the use of dictionaries or
heuristics to correctly distinguish words. However, the default
algorithm provided by Unicode still provides a best effort
implementation without those things.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk