Boost logo

Boost :

From: Alberto Barbati (abarbati_at_[hidden])
Date: 2003-01-12 06:32:27


Hi,

I just uploaded here http://groups.yahoo.com/group/boost/files/utf/ a
new version of the UTF library. The changes are:

1) Added missing typename keywords and used BOOST_DEDUCED_TYPENAME in
every applicable place

2) Added safety checks on buffer size. (Thanks to Dietmar Kuehl)

3) Now the state type is not assumed to be an integer type. In order to
access the state two unqualified free functions get_state() and
set_state() are used instead. File utf_config.hpp provides a default
*non-portable* implementation that relies on reinterpret_cast, which
should be specialized for each platform. (Thanks to Dietmar Kuehl)

The suite now compiles correctly on gcc cygwin yet it fails to link
because it complains about missing wchar_t specializations. Can anyone
help me on this?

It also seems that gcc does not provide specialization for any library
class (basic_filebuf, char_traits, etc.) for internal types different
from char and wchar_t. Could anyone confirm this? This could be a
problem if the user want to use UTF-32 facets but its wchar_t is 16 bit
wide. I can easily provide an implementation of char_traits for
implementations lacking it. Should I do it?

Alberto Barbati wrote:
> Dietmar Kuehl wrote:
>> Alberto Barbati wrote:
> The problem is that if char does not have 8 bits, then I cannot be sure
> that the underlying implementation reads from a file 8 bits at a time.
> Please correct me if I'm wrong on this point. That requirement is
> essential for the UTF-8 encoding.

Has anyone any comment about this? I don't have access to any
implementation where char has more than 8 bits to verify.

>>> There already exist a facility to select the correct facet according to
>>> the byte order mark. It's very simple to use:
>>>
>>> std::wifstream file("MyFile", std::ios_base::binary);
>>> boost::utf::imbue_detect_from_bom(file);
>>>
>>> that's it.
>>
>>
>> I have seen this possibility and I disagree that it is very simple to use
>> for several reasons:
>>
>> - There is at least one implementation which does not allow changing
>> the locale after the file was opened. This is a reasonable
>> restriction which seems to be covered by the standard (I thought
>> otherwise myself but haven't found any statement supporting a
>> different view). Thus, changing the code conversion facet without
>
> > closing the file may or may not be possible. Closing and reopening
> > a file may also be impossible for certain kinds of files.
>
> I guess you are mentioning 27.8.1.4, clauses 19 (description of function
> filebuf::imbue):
>
> "Note: This may require reconversion of previously converted characters.
> This in turn may require the implementation to be able to reconstruct
> the original contents of the file."
>
> That may indeed be a problem. In my humble opinion, the use of "may" is
> quite unfortunate... it seems that implementation need not reconvert
> previous characters and leaves unspecified (not even "undefined" nor
> "implementation defined") what happens if the implementation cannot
> perform the reconstruction.
>
> In which way is imbue implemented in the implementation you were
> mentioning?

I looked deeper into the question.

Of the three implementations I checked (VS.Net/Dinkumware, STLport, gcc
3.2 prerelease) none of them implement clause 19. gcc even has an
explicit comment about this. All of them allows imbue() in the middle of
a file. Which implementation where you talking about?

I am considering writing a mega-facet that automatically adapts to the
file encoding according to the BOM. It could easily be done for UTF-32
as the conversion code is already factored out of the facet classes
(splitted into file utfXX_algo.hpp and utf32_strategy.hpp). I plan to do
the same factorization for UTF-16 facets also; it is already done for
facet utf8_utf16. However, please bear in mind that such a facet can't
be as performant as the little ones, because each of
do_in/do_out/do_length functions have to be a large switch over the
several implementations and such a switch need to executed each of the
several times do_XXX is called for each character.

BTW, this mega-facet is ok when reading from a file. How should it
behave when writing? Will it be ok to return error until a encoding is
chosen? In fact, reading *and* writing at the same time to a Unicode
file is IMHO a sure way to disaster, unless writing always occur at end
of file with std::ios_base::app.

I am considering adding stream classes, derived from std::basic_*
classes (or maybe from boost::filesystem classes?) as a conveniency.
What do you think?

Alberto


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk