Boost logo

Boost :

From: Dietmar Kuehl (dietmar_kuehl_at_[hidden])
Date: 2003-01-07 21:32:05


Alberto Barbati wrote:
> One can use a char traits class different from
> std::char_traits<T>, that defines a suitable state type.

This is not really viable due to 27.8.1.1 paragraph 4:

  An instance of basic_filebuf behaves as described in lib.filebuf
  provided traits::pos_type is fpos<traits::state_type>. Otherwise the
  behavior is undefined.

It would be possible to create a conversion stream buffer (which is
probably a good idea anyway) which removes this requirement but even
then things don't really work out: stream using a different character
traits are not compatible with the normal streams. I haven't worked
much with wide character and don't know how important it is to have
eg. the possibility of using a wide character file stream and one of
the standard wide character streams (eg. 'std::wcout') be replacable.
I think it is crucial that the library supports 'std::mbstate_t'
although this will require platform specific stuff. It should be
factored out and documented such that porting to new platform consists
basically of looking up how 'std::mbstate_t' is defined.

> I forgot to say in my previous post that this version of the library
> only supports platforms where type char is exactly 8 bits. This
> assumption is required because I have to work at the octet level while
> reading from/writing to a stream.

I don't see why this would be required, however. This would only be
necessary if you try to cast a sequence of 'char's into a sequence
of 'wchar_t's. Converting between these is also possible in a portable
way (well, at least portable across platforms with identical size of
'wchar_t' even if 'char' has different sizes).

> Such decision is very strong, I know. Yet, one of the main problems with
> the acceptance of Unicode as a standard is that there are too many
> applications around that uses only a subset of it. For example, one of
> the first feedback I got, at the beginning of this work, was "I don't
> need to handle surrogates, could you provide an optimized facet for that
> case?". The answer was "Yes, I could, but I won't".

As I said, I don't have strong feelings about this (and I have
implemented such a facet myself already anyway...). However, note that
I requested something quite different: I definitely want to detect if
a character cannot be represented using the internally used character.
In fact, I would like to see this happen even for a 16 bit internal type
because UTF-16 processing is considerably more complex than UC2
processing and I can see people falling into the trap of testing only
cases where UC2 is used. That is, the implicit choice of using UTF-16
is actually a pretty dangerous one, IMO.

> There already exist a facility to select the correct facet according to
> the byte order mark. It's very simple to use:
>
> std::wifstream file("MyFile", std::ios_base::binary);
> boost::utf::imbue_detect_from_bom(file);
>
> that's it.

I have seen this possibility and I disagree that it is very simple to use
for several reasons:

- There is at least one implementation which does not allow changing
  the locale after the file was opened. This is a reasonable
  restriction which seems to be covered by the standard (I thought
  otherwise myself but haven't found any statement supporting a
  different view). Thus, changing the code conversion facet without
  closing the file may or may not be possible. Closing and reopening
  a file may also be impossible for certain kinds of files.

- Your approach assumes a seekable stream which is not necessarily
  the case: At least on UNIXes I can open a file stream to read from
  a named pipe which is definitely non-seekable. Adjusting the state
  internally can avoid the need to do any seeking, although admittedly
  at the cost some complexity encapsulated by the facet.

> I considered separating the facets class from the conversion
> implementation code, but I decided to keep the design one facet == one
> encoding. There are a couple of minor advantages:
>
> 1) the user can have just one or two encondings in your executable
> without taking all of them. (Of course if you use imbue_detect_from_bom
> you will need all encodings, anyway... ;)

I'm not saying that all encoding have to be folded into only one facet!
I would envision an UTF-8 facet, an UTF-16BE facet, etc. However, I
would also envision a Unicode facet which internally uses whatever
of these encodings it detects.

> 2) I avoid one indirection. The cost of this cannot be underestimated.

The respective conversion function can be a specific inline function.
I doubt that there would be any cost incured by calling such inline
functions.

> The VS.NET implementation is quite brain-dead about codecvt: it will
> call the do_in function (which is virtual!) up to 4 times *per
> character*. Three times the function will return "partial", the fourth
> times it will succeed. For every character.

Yes, I know. It is, BTW, not that brain-dead: Doing it differently
does not necessarily work because a 'std::basic_filebuf' has to restore
a position again. Especially with stateful multi-character encodings
I'm not sure whether there is indeed a different approach which
guarantees that the position is recoverable again. Of course, if you
don't need to recover the position again (which is a likely case IMO),
this overhead is entirely wasted.

> This is not correct. No conversion function makes assumptions on the
> size of the buffers (not intentionally, I mean!)
>
> In the specific case of utf16_utf32 you have:
>
> for(const char* limit = to_limit - 3; from < from_end && to <
> limit; ++from)
> {
> // accesses buffer using *to
> }
>
> if the buffer is shorter than 4 characters, the test "to < limit" fails
> immediately and the loop is never entered.

  T* array = new char[10];
  T* crash = array - 3;

The above two lines cause undefined behavior according to the C++
standard. Correspondingly 'ptr1 < ptr2' is defined if and only if
'ptr1' and 'ptr2' are pointers of the same type (not counting
cv-qualification) and point *into* the same array object or one behind
the last element. If this condition does not hold, the expression
'ptr1 < ptr2' causes undefined behavior, too.

At first sight, this restriction seems to be pretty esoteric but it is
actually not: On a segmented architecture, 'ptr - 3' might result in a
pointer which looks as if it points to the end of a segment. This in
turn means that 'ptr - 3 < ptr' does not necessarily hold if 'ptr'
points to one of the first two positions in an array object.

> The buffer is never accessed before or after the loop, so I don't see
> any undefined behaviour here.

I'd say it's time to read up on pointer arithmetic in the standard...

> Looking better, the only case that can go wrong is if the function is
> called with to == to_limit == 0, but I don't think a stream
> implementation could be so insane to do that. I can add a check for this
> case, though. Maybe under assert()?

You should defend against the buffer being smaller than you expect it to
be. As an optimization, the corresponding check can be removed if it is
known for a specific platform to not cause any problems (indeed, most
platforms I have worked on don't really cause problem when walking off
an array object; still, there are some where it causes problems). The
default should be defensive - and still operational! An 'assert()' won't
do.

> You are right, UTF-* encoding are in fact stateless. However, for a
> reason too complex to describe here (it will be written in the
> documentation for sure!) the facets that use UTF-16 internally need to
> be state-dependent in order to support surrogates.

I don't think so. UTF-16 is a multi-byte encoding but a stateless one.
... and I don't see how library issue #76 changes anything about this!
In fact, if it does, it is probably broken and the resolution needs
fixing, not the code using it. The cost of turning an encoding into a
stateful multi byte encoding is likely to be something you called
"brain dead" in your article: I'm not yet aware of doing the conversion
one external character at a time. All other encoding, ie. fixed width
encodings and stateless multi byte encodings, can do much better.

> What would be very helpful, at this stage, is having people who could
> compile and run the test suite on their favorite platform. This will
> surely stress the library and give some precious feedback.

I haven't done this, however.

-- 
<mailto:dietmar_kuehl_at_[hidden]> <http://www.dietmar-kuehl.de/>
Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk