Boost logo

Boost :

From: Alberto Barbati (abarbati_at_[hidden])
Date: 2003-01-08 17:51:38


Dietmar Kuehl wrote:
> Alberto Barbati wrote:
>
>>One can use a char traits class different from
>>std::char_traits<T>, that defines a suitable state type.
>
>
> This is not really viable due to 27.8.1.1 paragraph 4:
>
> An instance of basic_filebuf behaves as described in lib.filebuf
> provided traits::pos_type is fpos<traits::state_type>. Otherwise the
> behavior is undefined.

Thanks for pointing that out, I missed it. However it's not really a
problem, you can add a pos_type typedef to our test_traits, like this:

template <typename T>
struct test_traits : public std::char_traits<T>
{
     typedef boost::uint32_t state_type;
     typedef std::fpos<boost::uint32_t> pos_type;
};

> It would be possible to create a conversion stream buffer (which is
> probably a good idea anyway) which removes this requirement but even
> then things don't really work out: stream using a different character
> traits are not compatible with the normal streams. I haven't worked
> much with wide character and don't know how important it is to have
> eg. the possibility of using a wide character file stream and one of
> the standard wide character streams (eg. 'std::wcout') be replacable.
> I think it is crucial that the library supports 'std::mbstate_t'
> although this will require platform specific stuff. It should be
> factored out and documented such that porting to new platform consists
> basically of looking up how 'std::mbstate_t' is defined.

That's a better argument. I will think about it. As I said, I'm
definetely not against adding acceessors to mbstate_t, I just have to
think what's the better way to do it.

>>I forgot to say in my previous post that this version of the library
>>only supports platforms where type char is exactly 8 bits. This
>>assumption is required because I have to work at the octet level while
>>reading from/writing to a stream.
>
>
> I don't see why this would be required, however. This would only be
> necessary if you try to cast a sequence of 'char's into a sequence
> of 'wchar_t's. Converting between these is also possible in a portable
> way (well, at least portable across platforms with identical size of
> 'wchar_t' even if 'char' has different sizes).

The problem is that if char does not have 8 bits, then I cannot be sure
that the underlying implementation reads from a file 8 bits at a time.
Please correct me if I'm wrong on this point. That requirement is
essential for the UTF-8 encoding.

>>Such decision is very strong, I know. Yet, one of the main problems with
>>the acceptance of Unicode as a standard is that there are too many
>>applications around that uses only a subset of it. For example, one of
>>the first feedback I got, at the beginning of this work, was "I don't
>>need to handle surrogates, could you provide an optimized facet for that
>>case?". The answer was "Yes, I could, but I won't".
>
>
> As I said, I don't have strong feelings about this (and I have
> implemented such a facet myself already anyway...). However, note that
> I requested something quite different: I definitely want to detect if
> a character cannot be represented using the internally used character.
> In fact, I would like to see this happen even for a 16 bit internal type
> because UTF-16 processing is considerably more complex than UC2
> processing and I can see people falling into the trap of testing only
> cases where UC2 is used. That is, the implicit choice of using UTF-16
> is actually a pretty dangerous one, IMO.

I know it's dangerous, but I prefer that way. I would like this to be
"The UTF Library", not just some "conversion library". I also want to
support the Unicode standard to its full extent. Supporting a conversion
not covered by Unicode, just because someone finds it useful, does not
go in that direction. If this position would stop my proposal to be
accepted in Boost, I would just retire it.

>>There already exist a facility to select the correct facet according to
>>the byte order mark. It's very simple to use:
>>
>> std::wifstream file("MyFile", std::ios_base::binary);
>> boost::utf::imbue_detect_from_bom(file);
>>
>>that's it.
>
> I have seen this possibility and I disagree that it is very simple to use
> for several reasons:
>
> - There is at least one implementation which does not allow changing
> the locale after the file was opened. This is a reasonable
> restriction which seems to be covered by the standard (I thought
> otherwise myself but haven't found any statement supporting a
> different view). Thus, changing the code conversion facet without
> closing the file may or may not be possible. Closing and reopening
> a file may also be impossible for certain kinds of files.

I guess you are mentioning 27.8.1.4, clauses 19 (description of function
filebuf::imbue):

"Note: This may require reconversion of previously converted characters.
This in turn may require the implementation to be able to reconstruct
the original contents of the file."

That may indeed be a problem. In my humble opinion, the use of "may" is
quite unfortunate... it seems that implementation need not reconvert
previous characters and leaves unspecified (not even "undefined" nor
"implementation defined") what happens if the implementation cannot
perform the reconstruction.

In which way is imbue implemented in the implementation you were mentioning?

> - Your approach assumes a seekable stream which is not necessarily
> the case: At least on UNIXes I can open a file stream to read from
> a named pipe which is definitely non-seekable. Adjusting the state
> internally can avoid the need to do any seeking, although admittedly
> at the cost some complexity encapsulated by the facet.

We don't really need to seek, anyway. Once the BOM is extracted and
detected, I could just imbue the correct facet. Seeking back is just
overzealous, to allow the implementation to extract the BOM, but I agree
that it is kind of stupid.

> The above two lines cause undefined behavior according to the C++
> standard. Correspondingly 'ptr1 < ptr2' is defined if and only if
> 'ptr1' and 'ptr2' are pointers of the same type (not counting
> cv-qualification) and point *into* the same array object or one behind
> the last element. If this condition does not hold, the expression
> 'ptr1 < ptr2' causes undefined behavior, too.
>
> At first sight, this restriction seems to be pretty esoteric but it is
> actually not: On a segmented architecture, 'ptr - 3' might result in a
> pointer which looks as if it points to the end of a segment. This in
> turn means that 'ptr - 3 < ptr' does not necessarily hold if 'ptr'
> points to one of the first two positions in an array object.

Your right on everything, here. I'll add the check.

>>You are right, UTF-* encoding are in fact stateless. However, for a
>>reason too complex to describe here (it will be written in the
>>documentation for sure!) the facets that use UTF-16 internally need to
>>be state-dependent in order to support surrogates.
>
> I don't think so. UTF-16 is a multi-byte encoding but a stateless one.
> ... and I don't see how library issue #76 changes anything about this!
> In fact, if it does, it is probably broken and the resolution needs
> fixing, not the code using it. The cost of turning an encoding into a
> stateful multi byte encoding is likely to be something you called
> "brain dead" in your article: I'm not yet aware of doing the conversion
> one external character at a time. All other encoding, ie. fixed width
> encodings and stateless multi byte encodings, can do much better.

I think it the time for that complex explanation I was talking about.
The UTF facet I wrote do no involve one encoding, but two encodings
each, one on the internal side and one on the external side:

external sequence (bytes)
         |
external encoding (UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE)
         |
Unicode scalar value (abstract characters)
         |
internal encoding (UTF-16, UTF-32)
         |
internal characters (wchar_t or whatever)

The external encoding is never a problem.

If the internal encoding is UTF-32, everything's fine, because for each
Unicode scalar value there corresponds at most one internal character.
In facts, this is just a validation step, a few scalar values are
invalid and generate an error, all the others are simply mapped to
internal characters identically.

The problems arise for the internal UTF-16 encoding, where each valid
Unicode scalar value is returned as either one *or* two internal
characters (a surrogate pair).

But issue #76 explicitly requires that a codecvt facet must be able to
convert internal characters *one at a time*. So what to do when I
encounter a Unicode scalar value that requires a surrogate pair and the
implementation required one single character? Simple: I output the first
surrogate and store the second surrogate in the state. In the next call
I return the second surrogate.

Here explained why I need shift-states. I challenge you to find a better
way to have this kind of processing without violating issue #76. On the
other hand, issue #76 is not under discussion, its rationale is strong
as iron.

There is one more problem: some implementation requires the codecvt
facet to consume at least one character if it produces at least one
character. In my opinion, the standard does not allow this assumption. I
posted a DR to comp.std.c++ hoping the LWG will add explictly wording
about the issue. However, until implementation are fixed, we have to
face with them. Fortunately, I was able to find a way to always extract
at least one character for each character produced, without a great loss
of performance while still providing under #ifdef the more optimal code
for implementations that can handle it.

Please notice that the extra cost for handling this very complex case is
neglible if the source sequence does not contain characters that need
such a complex handling.

Alberto


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk