Boost logo

Boost :

Subject: Re: [boost] [locale] Review
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-04-18 08:24:25


> From: Mathias Gaunard <mathias.gaunard_at_[hidden]>
>
> Let us start with my decision: I vote against inclusion
> of Boost.Locale into Boost in its current state,
> but strongly encourage the author to reconsider
> resubmitting it after some design issues have
> been reconsidered. In which case, I would be in favour
> of putting it on the fast-track for a mini-review.

Can you please state explicitly what things are
conditional for the library to get to mini-review.

There are many points down but not everything
clear about if you consider them critical or not.

I assume not all things would be or may be solved
but I'd rather like to see more specific notes
like (a) missing XYZ interface for function ABC.

It would help me and review manager.

>
> Since I am the author of another Unicode library that
> I'd like to see included in Boost, I hope that
> my review will not be considered biased.
> I believe Boost.Locale is a valuable and useful
> library, and I believe both libraries can coexist.
>

I can mention only that I'm looking forward for
Boost.Unicode and I know that the way you look
on interfaces is very different.

Boost.Locale is build from very practical
point of view - primary usability. I don't
try to solve all possible corner cases, especially
when they are not central part of localization
task.

I'd rather prefer spending more time on adding
specific missing features useful for localization
over providing stuff like iterator based interface
for some operations that usually executed
on small chunks of text.

The localization topic is almost endless and
I have different priorities that are linguistic
and localization centric.
 
> The main reason for my decision is the
> interface that the library provides:
> albeit it aims at abstracting the
> backend used to implement the
> facilities, it exposes too much of
> how those facilities are implemented,
> which prevents benefiting from future
> backends that may not have the same
> limitations as the current backends do.
>
> Indeed, the Boost.Locale rationale clearly
> says that it only takes std::basic_string
> and const char* as input because that's what
> the backend libraries expect anyway.
>

Not only, there are more reasons for this:

  
http://cppcms.sourceforge.net/boost_locale/html/appendix.html#why_linear_chunks
 
> Yet there are various scenarios where this is useful:
> - I might want to apply some operation to data in a file
> without putting it all to memory first; Boost.Locale
> could copy that memory by chunks to apply that operation
> for backends that have no native support for this.
> Copying the whole file to memory is a good enough approximation
> for a first version.
> - The input I have may not be directly available in the
> form of a contiguous buffer, but I wouldn't mind if
> Boost.Locale copied it.
>

The point, that apart of Boost.Unicode I don't see any option
for specific backend now or in near future that would require
such things.

Apart for charset conversion (that already supports
streaming via codecvt see an example in tutorial)
all known to me Unicode/Localization API work on single
chunk of memory.

I don't see myself implementing it other way.

When Boost.Unicode would be ready I'm sure its users
would be glad to use its own native interface where
it is needed.

So do we need more generic API that would never be used?

I really doubt.

> If we consider the functions from_utf, to_utf and between,
> they can either take a const char*, two const char*, or a
> std::basic_string. Clearly this would be better if those functions
> took a range, which covers all three of these cases.
> It's not just a matter of being generic, it makes the
> library easier to use since the range is defined by a
> single argument which can be anything I want, I do not need
> to convert to what the library needs. If the library needs to
> do conversion behind the scenes, it can do it just like
> that, behind the scenes.
>

Actually it is convenience API.

> Taking a range causes the problem however of
> what the return type should be: I think there
> should be two variants, one which returns a
> string, and one which appends to a container
> (like Boost.StringAlgo) or write to an output
> iterator (like STL algorithms). Obviously, the
> former would be defined in terms of the latter.
>
> This causes however a big problem: all backends
> would not be interchangeable with a stable
> unique ABI. Maybe some template specializations
> could be injected at link-time, but they would
> have to be full concrete specializations.
>

To be honest I don't think that iterators
especially over non-UTF charsets are good
idea.

I'd rather accept a concept of stream
that you read from and write to:

1. It is buffered
2. It is fast and can work with
   big chunks.
3. It can be passed transparently
   and ABI stable to any backends.

So I would rather see implementation of
generic algorithms over I/O streams rather
then iterators.

> Maybe the idea of backends being interchangeable
> at the ABI level needs to be rethought.
>
> After all, isn't being interchangeable at
> the API level enough, since you can easily
> mangle the configuration in a symbol to cause
> an error if linking with the non-matching backend?
>

It is actually a center part of the library
design that I think had very good effect
on it.

I'm working with localization and actually
use it for localization purposes and I know
how valuable it is to have generic stable
and independent API and different backends
especially when ICU maybe heavy dependency.

Without it I wouldn't develop the library
in first place.

 
> Still talking about conversion between encodings,
> I would like it if it was possible to statically
> choose the encodings we're converting between,
> as some backends could directly expose a specialized
> and optimized some_encoding_to_some_other_encoding function.
>

Actually almost all backends are optimized for UTF-8
where it is possible.

So without brining charset tables into Boost.Locale
I can see only very few character specializations
can be used:

1. UTF-XYZ to/from UTF-ABC
2. Latin1/ISO-8859-1 to/from UTF-XYZ

But I think Boost.Unicode would do it better as
it specializes in handling UTF-XYZ.

Note, you will almost never see in Boost.Locale
API any kind relations to code-points as code
point is meaning-less in context of localization.

It is too low level abstraction that I actually
prefer developers not even relate to.

I don't try to implement tasks that should
UTF-8/16/32 library do.
 
> Note the remarks about conversion between encodings
> also applies to case folding, collation etc.
> Being able to specify statically the normalization
> form you're normalizing to and things like that could
> be useful just as for the character encodings.
>

I know this (we talked about it before).

However I don't relate to it as almost all
Unicode API around.

Such low level optimizations are good
but IMHO corner case.

It is good for lower level library.

 
> Perhaps not made clear from the docs,
> but Boost.Locale embeds another mechanism
> than the functions in the conv namespace
> to convert between character encodings:
> using the classes derived from
> boost::locale::util::base_converter.
> This is the only interface that can be
> passed incomplete data, as obtained through buffering.
> That interface was designed to implement codecvt facets,

Exactly and this is its purpose.

> and unfortunately, it's not much better.
> The library implements its own UTF-8/UTF-16/UTF-32 conversion
> logic there regardless of which backend is used; it is a shame
> it cannot reuse the to_utf or from_utf functions in all cases.
> This implementation is also quite tricky since the state management
> prevents the code from being simple and self-contained.
>

Actually the code reuses it in many places, but there
is a small point.

Unlike the case of normal to_utf/from_utf calls
the facets know their encoding and may do things
faster as for example in ICU it uses this
knowledge to perform some tasks better.

> That interface does not allow optimized conversion
> functions to be used. Compatibility with codecvt facets
> is a nice feature, but it shouldn't impair the interface
> of the component that could have other usages.
>

It is not nice feature, it is a feature that allows
to use Boost.Filesysem and Boost.ProgramOptions
correctly with UTF-8 or other encodings
on Windows platform with total encoding mess.

Using Boost.Locale and UTF-8 codecvt facet opens
a door to all other libraries to adopt it.

What is important that std::codecvt is part
of C++ standard and thus any library that uses
this standard would enjor the full power
of Boost.Locale without even depending on it.

> Furthermore, I do not find the boundary analysis
> interface satisfying at all. It is not made clear
> in the docs that doing boundary analysis actually
> goes through the whole string and returns a vector
> of positions corresponding to the boundaries.
> You can use that vector to adapt any range into a range
> of strings. Each dereferencing copies the subrange
> defined by the vector into a new temporary string.
> This is pretty terrible. I want to be able to find the
> boundaries as I advance through a string, or be able to
> find the closest boundary from an arbitrary position
> without having to process the whole string.
> I do not want to have to copy my whole input range
> into little strings for no reason.

I may agree that it is not perfect, but giving
limitations of current state-of-the-art implementations
in ICU it is yet reasonable one.

AFAIK Boost.Unicode implemets its own
interface for text segmentation
but without ICU's custom locale
rules and dictionaries that are important
for text segmentation of some languages.

> Usage of this feature is also unnecessarily verbose.
> It should not be more than a function call to turn a
> range of characters into a range of tagged ranges of characters.
> This part of the library is the one that needs the more work.
> Ironically enough, it's also the only one that deals a bit
> with iterators and ranges, and I just asked for more iterators
> and ranges at the beginning of this review ;).
>

> Finally, I will not comment on the implementation too much,
> since I think other Boosters already did that
> (though I did not read all of it). There is some
> unnecessary or non-optimal code

Can you point? There were some duplications but non-coptimal?

> some code more complicated than it should be

Can you be more specific?

> some compile-time values tested at runtime,
> some things that could have reused existing
> utilities from Boost,

More specific points?

> and some inappropriate
> use of reinterpret_cast.
>

In two places where it is going to be fixed.

> While most of the backends are separated,
> it is not the case for encoding/codepage.cpp,
> which is shared by all backends and contains ifdefs.
> This is not a very nice extension point.
>

Actually not codepage only.

- Formatting (locale::format)
- Locale information
- Message Formatting (gettext)
- codecvt facets (not details)

And more...

So there are lots of shared parts.

Now unlike most of operations that
are locale related codepage handing
is basics that is used all over the code.

So I don't really understand how
do you see it might actually be?

> I'm also a bit concerned about the generalized
> usage of virtual functions anywhere;
> it seems unnecessary in quite a few places,
> please prefer using templates when dynamic binding is not required.

More specific points?

> A lot of new and vectors too, I'd prefer if ideally
> the library never allocated anything.

I'm sorry but it is just something
that can't and would never happen.

This request has no reasonable base especially
for such complex topic.

>
> I hope my review will be well received
> and that you will consider all of my "feature requests".
> Your library is practical and can get things done for sure,
> but maybe it's not quite as good as it could be in terms of
> interface yet.
>

The interface is a part of the design and have
designed to be simple and covinent rather then
fully generic.

But I understand that we have very different points
of view.

> Also, if I want Boost.Unicode to ever be a
> possible backend, Boost.Locale needs to be
> able to make the most of it ;).
>

Actually I was thinking about it.

Boost.Locale and Boost.Unicode very different:

- Boost.Locale is locale centric
- Boost.Unicode is locale agnostic.

Few points from Boost.Unicode that can actually
serve Boost.Locale:

- Normalization

  But I assume that it would be better to use Boost.Unicode
  directly as it is really locale independent
  procedure.

  It is part of Boost.Locale for convinience

- Collation if it is parametrized so it can be
  changed for each locale separatly

- Boundary analysis for a subset of languages

  not Japenese not Thai and some others at leaset
  for how Boost.Unicode looks like at this point.

I rather perfer to look at
Boost.Unicode as orthogonal library to Boost.Locale
as it does very different things.

It can't be Boost.Locale's backend as at least
in current state it is locale agnostic
and misses most important feature localization is actually
required - CLDR - Common Locale Database Repositry

Thank you for the review.

  Artyom

 
> _______________________________________________
> Unsubscribe & other changes:
http://lists.boost.org/mailman/listinfo.cgi/boost
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk