Boost logo

Boost :

Subject: Re: [boost] [locale] review part 2.1: source
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-04-10 05:12:09

> Okay, I managed to get through the encoding, icu, and posix
> subdirectories today.
> encoding/codepage.cpp:
> line 28: In C++ you should use <cstring> (and std::strlen)


> encoding/conv.hpp:
> encoding/iconv_codepage.hpp
> I think the convention for headers like
> this which are really source files in disguise
> is to give them the extension .ipp.

Noted. I'll update

> line 125: You don't actually care how the vector's
> contents are initialized, do you? You can just
> specify the size.

I'm not remember exactly but I think I do. I'll review

> line 150: I think sresult.append(out_start, out_ptr);
> would be slightly clearer.

I should look on it carefully whether I can change

iconv API is very gentle in how it handles its pointers
and states so I need to do everything carefully.

Also I assume that append(Begin,End) may be less
efficient then append(pointer,size) as it may be
naively implemented as

    s += *begin++;

> line 168: What are the possible errno values
> besides EILSEQ, EINVAL, and E2BIG? If
> you get an error you don't expect, is
> silently stopping the correct behavior?

According to opengroup's specifications, Linux
man pages and others these are the error states that
can occur.

Maybe I should throw conversion_error if the method is stop().

> iconv_between: Inheritance from iconv_from_utf is
> a little odd.

But it makes a code much simpler.

> iconv_to_utf: The comments from iconv_from_utf apply.
> Also, the implementation is almost the same as
> iconv_from_utf. I think you could define a single
> function template to implement both.

Not exactly. I had found that it is much
safer and readable to have two implementations
then to create quite complex meta programming
to make sure it works.

It is much easier to assume that at least
one side is "char"

> encoding/uconv_codepage.hpp:
> line 52: Do you really intend to translate any
> std::exception? Including std::bad_alloc?
> That isn't consistant with what iconv_codepage does.

The problem is that ICU does not throw any exceptions
(yes... even new) so any exceptions that should be
thrown there are caused by the handling of conversions.

So I thought it is more then reasonable.

> encoding/wconv_codepage.hpp:
> Missing #include <cstring>, <string>


> lines 86,101: The STL is your friend:
> size_t remove_substitutions(std::vector<wchar_t> &v)
> {
> v.erase(std::remove(v.begin(), v.end(), wchar_t(0xFFFD)), v.end());
> return v.size();
> }

After reading it carefully I can see how it works but
generally I see how it works but it isn't so readable.

In any case interesting and it may save some allocations.


> icu/all_generator.hpp:
> Why <vector>? You're using locale and string, but not vector.

Leftovers of old code.


> icu/boundary.cpp:
> icu/cdata.hpp:
> Missing #include <string>


> icu/uconv.hpp:
> Needs #include <string> and #include <memory>.
> line 8: This #include guard is excessively generic.

Yes... you are right. Missed this.

> icu/codecvt.cpp:
> lines 44, 48: If this throws, cvt_ needs to be closed.

You are right... I don't understand how did I miss this.

> icu/codecvt.hpp:
> Missing #include <memory>
> The #include guard is wrong.

Ohh... Yes you right thanks.

> icu/collator.cpp:
> line 115: Does do_basic_transform guarantee that
> the string contains no embedded null bytes?
> It seems that this is required for hash to
> work correctly.

Yes, the string returned by ICU transform explicitly
states this. It can be compared using standard strcmp.

> icu/conversion.cpp:
> icu/date_time.cpp:
> In calendar_impl: I can't figure out why you're locking
> in some functions and not others. For the most part,
> you seem to lock in non-mutating functions and not
> in mutating functions, so I was guessing that you
> are trying to guarantee that a calender can be read
> concurrently, but writes must be exclusive and that
> ICU does not guarantee that concurrent reads are
> okay. However, get_timezone and is_same don't lock.
> Does ICU handle these methods differently from the
> others W.R.T. thread-safety?

Ok I'll explain.

Some of the icu::Calendar member functions are only "sematically"
const but actually recalculate dates on request.

This is not really nice feature of ICU calendar, and after
rising such issue to ICU team they included in their
work plan making some of classes frizable. So they
would be explicitly safe for cuncurrent read only
access. (Not sure it would happen for calendar
but it will for icu::Collator)

Fortunatelly the "semtantic" constness is explicitly
documented and for member function that access to
time or some ICU fields - that actually perform
recalculation I do perform locking so it would
be thread safe to access to same const instance
from different threads.

> line 236: Do you need a lock here? You're only using
> self which is a different object.

Yes you are right... It was accidentially remained from
old buggy code (back then I missed that fieldDifference
acutally updates calendar and now I anyway clone so no
need for lock)

> icu/formatter.cpp:
> line 69: This seems like a really bad idea. It means
> that integers larger than 0x7FFFFFFFFFFFFFFF will
> be formatted incorrectly. If this is an unavoidable
> limitation of ICU, I think you shoud probably throw
> an exception and document the behavior. And don't
> think that people won't try to print unsigned numbers
> this big. I know I have.

Yes it is limitation of ICU.

ICU supports int32_t, int64_t and double (much like Java)

So I don't know whether it is better to abort the application
or maybe display incorrect number.

I'll explain.

The translator or programmer can use:

   format(translate("The distance to the Sun is {1,number} millimeters"));

However if it does not seems right translator or programmer can
always use

   format(translate("The distance to the Sun is {1} millimeters")); // POSIX

Which handled correctly but does not provides nice output
as "1,445,452,345,353,435" but rather "1445452345353435"

Because it is finally the output for human I prefer that
the output would be strange but program does not crash.

It is always better that user would report a bug when he
sees negative number and the programmer would change
{1,num} to {1,posix} rather then program accidentially
exists in some unknown situation.

I think I'll document this behavior (of the range).

Also I can check if it is possible to fallback to POSIX
format if the range is not supported by ICU.

> line 263: This comparison is subtly wrong for int64_t.
> Consider what happens when date is exactly 2^63.
> limits_type::max() will be 0x7FFFFFFFFFFFFFFF.
> According to [expr] this will be converted to
> double. For most rounding modes, the result
> will be 2^63, which is equal to date. Then,
> when we get to line 268 and cast back to ValueType,
> the result will be 0x8000000000000000, and the
> behavior is undefined according to [conv.fpint].
> (Note: It's obviously not easy to trigger this,
> but I'm guessing that a specially crafted test
> case could)

I'll review it.

> icu/formatter.hpp:
> line 44: What about float and long double. long double
> may have more precision than double. Also, even for
> float, the double rounding caused by working with double
> can produce slightly different results from when using
> float directly. Does ICU only allow you to use double
> and [u]int[32|64]_t?

It only allows double, int32_t, and int64_t so it is best
what I can do.

Also I select most suitable integer type on the level of the
facet (numeric.cpp)

> line 93: I find this comment a bit misleading. It wasn't
> clear until I looked at the implementation that what
> was being cached was the ICU formatter. The
> boost::locale::impl_icu::formatter object is not
> cached.

Yes. You are right... Also the comment seems to be little
bit old as all formatters ara cached at once using TLS.

> icu/icu_backend.cpp:
> line 69: Ideally you should not mark the object
> as valid until you know that this function will
> not fail.

Good point.

> line 91: I'd kind of like install to be thread-safe...

No localization backend is not expected to be thread

It is used in this way:

1. It is cloned from the original one
2. All options are set
3. The facets are installed.

So its thread safety implied by its clonability

Also it allows me to install facets faster if underlying
implementation is not thread safe.

In any case localization_backend is usually not
used directly by the user but rather by boost::locale::generator

> icu/icu_backend.hpp:
> icu/icu_util.hpp:
> icu/numeric.cpp:
> In num_format: You don't handle bool?

ICU does not handle it... So I can't do something
better for this - I pass it to std::locale::classic

> line 290: Alas, the standard doesn't guarantee that
> you can put_back an arbitrary number of characters.
> This isn't going to work if you're reading from a
> non-seekable stream and the line is sufficiently long.

The problem is that it is the best I can do from-inside
std::locale's num_fmt facet.

It has many major limitations. For exaple if you would
try to parse 12345YX as

  std::cin >> number >> character

then character would be X.

This is specially critical when I parse spellout format
to be able to putback correct part.

Maybe I should try to see what is the size of the
buffer and/or document that the stream
should be able to handle putback well.

It is actually fairly simple as streambuf has a buffer
so it can actually receive putback characters without problem.

Unfortunatelly it is the best solution I could
reach so far with the limiteation of std::num_get

> icu/predefined_formatters.hpp:
> icu/time_zone.hpp:
> icu/time_zone.cpp:
> icu/uconv.hpp:
> posix/all_generater.hpp:
> posix/codecvt.cpp:
> mb2_iconv_converter seems to be doing a lot
> of the same work as encoding/iconv_codepage.
> Is there any way to avoid the duplication?

No it does very different things.

1. iconv is used for double byte encodings only like Shift-JIS or GBK
   as for single byte and UTF-8 better solutions implemented.

2. It does not perform stream conversion but rather conversion of single
   byte or two bytes of sources.

So finally it has more differences then common stuff.

> posix/codecvt.hpp:
> Missing #include <memory> and <string>.
> posix/collate.cpp:
> posix/converter.cpp:
> I noticed that you convert UTF8 to wchar_t.
> For the posix backend, are there other multibyte
> encodings that would need this? Or would
> converting everything to UTF32 cause other
> problems?

- For most encodings (single byte) toupper/tolower
  does better job (faster simpler smaller)

- For UTF-8 is is very suitable but still it is major
  approximation is towupper/towlower should not work
  on single code points but rather on entire text.

  So it is poor-man's approximation that works almost
  correctly for most western languages.

- Now we remained with East-Asian double width
  encodings like Shift-JIS.

  This is quite different story:

  1. The East-Asian scripts do not have upper and lower
     case thus actually to upper and lower do nothing
     for almost all but latin part.
  2. In any case it remains approximation.

  So I think using normal toupper_l to tolower_l
  would be good enough as it is still approximation
  and because East-Asian scripts does not have
  case I think it is ok.

  I can change it for MB encodings but not sure if
  it would have added value.

> posix/numeric.cpp:
> line 54: sizeof(buf) is 64, but you pass 256?

Oppps... I'll fix this

> line 57: is 1024 absolutely guaranteed to be
> enough? Otherwise you should use a loop
> with exponential reallocation. Line 97 too.

Ok... I actually has other even wother bug.
I don't handle errors correctly.

I should definitelly fix this. And it seems
that I can add reallocation (I assumed
that 0 returns on error and I have no
way to distinguish between error and
allocation. I assume I had written this
code in too late hours and misread the docs)

> lines 68, 75: std::copy?

I have limit by size not by iterator, so it is

> posix/posix_backend.cpp:
> install: The same comment W.R.T. thread-safety applies
> as for icu.

Same as above.

> line 85: If this fails, don't you still need to call freelocale?
> I think this would be simplified if you created a simple RAII
> wrapper that holds a locale_t by value.

Ok... I see missed this.

I prefer to have it shared as every facet has its own
copy (2 characters * 4-5 facets)

I'll fix this.

> posix/posix_backend.hpp:
> (To be continued)
> In Christ,
> Steven Watanabe

Thank you very much...

This is very-good code review, it is very helpful!


Boost list run by bdawes at, gregod at, cpdaniel at, john at