Boost logo

Boost :

From: Alberto Barbati (abarbati_at_[hidden])
Date: 2003-01-07 16:53:14


First of all, thanks to everybody for your feedback. I realized that my
message was a bit arrogant about the lack of interest... I apologize for
that and I promise I'll give my best to get this library to boost standards!

Dietmar Kuehl wrote:
> - The 'state' argument for the various facet functions is normally accessed
> as a plain integer in your code. However, eg. 'std::mbstate_t' is not an
> integer at all. In fact, it is implementation defined and can thus be
> something entirely different on each platform. I'd suggest to provide some
> from of accessor functions to read or write objects of this type, possibly
> in the form of some traits class. This could then be used to cope with
> differences and/or more complex organization of the state type.

That's a good point. On my platform (Win32) mbstate_t indeed is a
typedef to "int" and thus satisfies all requirements (i.e.: being an
integer type capable of containing an unsigned 21-bit value). I was
aware that this is not the case on other platforms, but I made the
assumption anyway simply because the user is not really required to use
mbstate_t. One can use a char traits class different from
std::char_traits<T>, that defines a suitable state type. For example:

template <typename T>
struct MyTraits : public std::char_traits<T>
{
     typedef boost::uint32_t state_type;
};

then use basic_fstream<T, MyTraits<T> > instead of basic_fstream<T>.

[I just tried that and found a typo in correctness_test.hpp that
prevents the test suite to compile, but nothing too serious]

Of course I have to document this ;)

If you believe that harnessing with the char traits class is
undesirable, or simply "too much", I see no problems in adding those
accessors function as you suggest.

> - Falling back to UTF-16 in cases where the Unicode characters may not
> fit into the words is one possible approach to deal with the characters.
> Another approach is to just indicate an error. For example, I know that
> certain XML-files I'm using exclusively store ASCII characters and there
> is no need to use a different internal character type than 'char'. If I ever
> come across a non-ASCII character I don't really want to have it encoded
> in something like UTF-8 but I want to fail reading the file (and possibly
> retry reading using a bigger character type). I would appreciate if such a
> possibility would be incorporated, too (this is, however, nothing I feel too
> strongly about).

I forgot to say in my previous post that this version of the library
only supports platforms where type char is exactly 8 bits. This
assumption is required because I have to work at the octet level while
reading from/writing to a stream.

That said, I have deliberately decided not to allow "char" as the
internal type. The internal type must be an integral type able to
represent an unsigned 16-bit value (for UTF-16) or an unsigned 21-bit
value (UTF-32). The choice between UTF-16 and UTF-32 as the internal
encoding is done at compile-time based on the sizeof of the internal
char type, which must be either 2 or 4.

Such decision is very strong, I know. Yet, one of the main problems with
the acceptance of Unicode as a standard is that there are too many
applications around that uses only a subset of it. For example, one of
the first feedback I got, at the beginning of this work, was "I don't
need to handle surrogates, could you provide an optimized facet for that
case?". The answer was "Yes, I could, but I won't".

Yours is a very special case. Frankly, I'd rather not support it. In
fact, it would be extremely very simple to do: you just take the facet
declared in file detail/degenerate.hpp and change a few lines. However,
it would be out of the intent of this library, which is to provide UTF
conversion according to Unicode 3.2 requirements, no more, no less.

> - In the context I want to use the facets [at least those I'm implementing]
> I don't really want to bother about the details of the encoding. That is, I
> just want to 'imbue()' an appropriate 'std::locale' object to the stream and
> have the facet figure out what encoding is used, especially when reading
> from some file. The basic idea is that each file is either started by a byte
> order mark from which UTF-16BE or UTF16LE can be deduced or it is in
> UTF-8. The encoding could eg. be stored in typical 'std::mbstate_t'
> objects (these are often a struct with a 'wchar_t' and a count or something
> like this). To enable something like this, it would be helpful if the actual
> conversion functions were actually separated from the facets: The facets
> would just call the corresponding conversion functions. The actual
> conversion is, apart from the state argument, entirely stateless and there
> is no need to bind it to any facet.

There already exist a facility to select the correct facet according to
the byte order mark. It's very simple to use:

     std::wifstream file("MyFile", std::ios_base::binary);
     boost::utf::imbue_detect_from_bom(file);

that's it.

I considered separating the facets class from the conversion
implementation code, but I decided to keep the design one facet == one
encoding. There are a couple of minor advantages:

1) the user can have just one or two encondings in your executable
without taking all of them. (Of course if you use imbue_detect_from_bom
you will need all encodings, anyway... ;)

2) I avoid one indirection. The cost of this cannot be underestimated.
The VS.NET implementation is quite brain-dead about codecvt: it will
call the do_in function (which is virtual!) up to 4 times *per
character*. Three times the function will return "partial", the fourth
times it will succeed. For every character. The less work I do in the
fail case, the better.

> - The conversion functions seem to assume that there are input sequence
> consisting of more than one character passed to it. I don't think this is
> guaranteed at all. At the very minimum, the facets should defend against
> being called with too small buffers (at least in 'do_out()' of 'utf16_utf32'
> the constant 3 is substracted from a pointer without prior check that the
> input area consists of at least three characters, thereby possibly invoking
> undefined behavior). I think that it is acceptable behavior of a facet user
> (like eg. a file stream buffer) to feed individual characters to the
> 'do_*()' functions or at least to increase the buffers char by char (I think
> that at least one of the file stream buffer implementations actually does
> just that in certain cases!).

This is not correct. No conversion function makes assumptions on the
size of the buffers (not intentionally, I mean!)

In the specific case of utf16_utf32 you have:

     for(const char* limit = to_limit - 3; from < from_end && to <
limit; ++from)
     {
       // accesses buffer using *to
     }

if the buffer is shorter than 4 characters, the test "to < limit" fails
immediately and the loop is never entered. The buffer is never accessed
before or after the loop, so I don't see any undefined behaviour here.

Looking better, the only case that can go wrong is if the function is
called with to == to_limit == 0, but I don't think a stream
implementation could be so insane to do that. I can add a check for this
case, though. Maybe under assert()?

> - My understanding of 'encoding()' is that '-1' indicates that "shift states"
> are used and that a special termination sequence returning the state into
> some initial state is possibly needed. That is, although the 'state'
> variable is used, the UTF-* encodings are considered to be stateless and
> there is no need to 'unshift()' anything because the 'do_out()' function
> will eventually write all what is necessary if it gets a big enough output
> buffer. That is, the 'encoding()' members should probably return a '0' or a
> positive integer for all UTF-* encodings since the encodings are essentially
> stateless. My personal feeling is that UTF-16 is, however, a variable width
> encoding, even if the internal encoding is also UTF-16: A perverted user of
> the facet might legally present the indivual words out of order to the
> facet of a fixed width encoding, eg. in a first pass all odd words and in a
> second all even words. This would rip surrogate sequences apart causing
> unexpected errors.

You are right, UTF-* encoding are in fact stateless. However, for a
reason too complex to describe here (it will be written in the
documentation for sure!) the facets that use UTF-16 internally need to
be state-dependent in order to support surrogates. The UTF-32 facets
don't need a special treatment and are indeed stateless.

If you can't wait for docs to be written, the root of all evil is the
C++ Library issue #76:

http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/lwg-defects.html#76

> The code itself is pretty readable. However, I haven't made a thorough
> review of it [yet].

Thanks! Then I have to prepare the docs quickly! I put more comments in
the code than I usually do, but I feel they are not enough to explain
everything ;)

What would be very helpful, at this stage, is having people who could
compile and run the test suite on their favorite platform. This will
surely stress the library and give some precious feedback.

Next step for me: prepare a test suite that overcome the mbstate_t
problem ASAP.

Alberto


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk