|
Boost : |
From: Dietmar Kuehl (dietmar_kuehl_at_[hidden])
Date: 2003-01-06 21:13:55
On Monday 06 January 2003 03:52 am, Alberto Barbati wrote:
> I hope that in front of a full-blown and working
> library I might get more attention.
Since the Boost list has gotten (IMO) a little bit out of hand, I just haven't
seen the previous mail. Actually, it was lucky I came across this one...
Anyway, I have a few comments on this since I'm currently working on
something very similar.
As a first comment, I think that such facets are pretty useful. However,
from just glancing at the implementation and the interfaces I have a few
points which may be worth for consideration:
- The 'state' argument for the various facet functions is normally accessed
as a plain integer in your code. However, eg. 'std::mbstate_t' is not an
integer at all. In fact, it is implementation defined and can thus be
something entirely different on each platform. I'd suggest to provide some
from of accessor functions to read or write objects of this type, possibly
in the form of some traits class. This could then be used to cope with
differences and/or more complex organization of the state type.
- Falling back to UTF-16 in cases where the Unicode characters may not
fit into the words is one possible approach to deal with the characters.
Another approach is to just indicate an error. For example, I know that
certain XML-files I'm using exclusively store ASCII characters and there
is no need to use a different internal character type than 'char'. If I ever
come across a non-ASCII character I don't really want to have it encoded
in something like UTF-8 but I want to fail reading the file (and possibly
retry reading using a bigger character type). I would appreciate if such a
possibility would be incorporated, too (this is, however, nothing I feel too
strongly about).
- In the context I want to use the facets [at least those I'm implementing]
I don't really want to bother about the details of the encoding. That is, I
just want to 'imbue()' an appropriate 'std::locale' object to the stream and
have the facet figure out what encoding is used, especially when reading
from some file. The basic idea is that each file is either started by a byte
order mark from which UTF-16BE or UTF16LE can be deduced or it is in
UTF-8. The encoding could eg. be stored in typical 'std::mbstate_t'
objects (these are often a struct with a 'wchar_t' and a count or something
like this). To enable something like this, it would be helpful if the actual
conversion functions were actually separated from the facets: The facets
would just call the corresponding conversion functions. The actual
conversion is, apart from the state argument, entirely stateless and there
is no need to bind it to any facet.
- The conversion functions seem to assume that there are input sequence
consisting of more than one character passed to it. I don't think this is
guaranteed at all. At the very minimum, the facets should defend against
being called with too small buffers (at least in 'do_out()' of 'utf16_utf32'
the constant 3 is substracted from a pointer without prior check that the
input area consists of at least three characters, thereby possibly invoking
undefined behavior). I think that it is acceptable behavior of a facet user
(like eg. a file stream buffer) to feed individual characters to the
'do_*()' functions or at least to increase the buffers char by char (I think
that at least one of the file stream buffer implementations actually does
just that in certain cases!).
- My understanding of 'encoding()' is that '-1' indicates that "shift states"
are used and that a special termination sequence returning the state into
some initial state is possibly needed. That is, although the 'state'
variable is used, the UTF-* encodings are considered to be stateless and
there is no need to 'unshift()' anything because the 'do_out()' function
will eventually write all what is necessary if it gets a big enough output
buffer. That is, the 'encoding()' members should probably return a '0' or a
positive integer for all UTF-* encodings since the encodings are essentially
stateless. My personal feeling is that UTF-16 is, however, a variable width
encoding, even if the internal encoding is also UTF-16: A perverted user of
the facet might legally present the indivual words out of order to the
facet of a fixed width encoding, eg. in a first pass all odd words and in a
second all even words. This would rip surrogate sequences apart causing
unexpected errors.
The code itself is pretty readable. However, I haven't made a thorough
review of it [yet].
-- <mailto:dietmar_kuehl_at_[hidden]> <http://www.dietmar-kuehl.de/> Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk