Boost logo

Boost :

From: Dietmar Kuehl (dietmar_kuehl_at_[hidden])
Date: 2003-01-06 21:13:55


On Monday 06 January 2003 03:52 am, Alberto Barbati wrote:
> I hope that in front of a full-blown and working
> library I might get more attention.

Since the Boost list has gotten (IMO) a little bit out of hand, I just haven't
seen the previous mail. Actually, it was lucky I came across this one...
Anyway, I have a few comments on this since I'm currently working on
something very similar.

As a first comment, I think that such facets are pretty useful. However,
from just glancing at the implementation and the interfaces I have a few
points which may be worth for consideration:

- The 'state' argument for the various facet functions is normally accessed
  as a plain integer in your code. However, eg. 'std::mbstate_t' is not an
  integer at all. In fact, it is implementation defined and can thus be
  something entirely different on each platform. I'd suggest to provide some
  from of accessor functions to read or write objects of this type, possibly
  in the form of some traits class. This could then be used to cope with
  differences and/or more complex organization of the state type.

- Falling back to UTF-16 in cases where the Unicode characters may not
  fit into the words is one possible approach to deal with the characters.
  Another approach is to just indicate an error. For example, I know that
  certain XML-files I'm using exclusively store ASCII characters and there
  is no need to use a different internal character type than 'char'. If I ever
  come across a non-ASCII character I don't really want to have it encoded
  in something like UTF-8 but I want to fail reading the file (and possibly
  retry reading using a bigger character type). I would appreciate if such a
  possibility would be incorporated, too (this is, however, nothing I feel too
  strongly about).

- In the context I want to use the facets [at least those I'm implementing]
  I don't really want to bother about the details of the encoding. That is, I
  just want to 'imbue()' an appropriate 'std::locale' object to the stream and
  have the facet figure out what encoding is used, especially when reading
  from some file. The basic idea is that each file is either started by a byte
  order mark from which UTF-16BE or UTF16LE can be deduced or it is in
  UTF-8. The encoding could eg. be stored in typical 'std::mbstate_t'
  objects (these are often a struct with a 'wchar_t' and a count or something
  like this). To enable something like this, it would be helpful if the actual
  conversion functions were actually separated from the facets: The facets
  would just call the corresponding conversion functions. The actual
  conversion is, apart from the state argument, entirely stateless and there
  is no need to bind it to any facet.

- The conversion functions seem to assume that there are input sequence
  consisting of more than one character passed to it. I don't think this is
  guaranteed at all. At the very minimum, the facets should defend against
  being called with too small buffers (at least in 'do_out()' of 'utf16_utf32'
  the constant 3 is substracted from a pointer without prior check that the
  input area consists of at least three characters, thereby possibly invoking
  undefined behavior). I think that it is acceptable behavior of a facet user
  (like eg. a file stream buffer) to feed individual characters to the
  'do_*()' functions or at least to increase the buffers char by char (I think
  that at least one of the file stream buffer implementations actually does
  just that in certain cases!).

- My understanding of 'encoding()' is that '-1' indicates that "shift states"
  are used and that a special termination sequence returning the state into
  some initial state is possibly needed. That is, although the 'state'
  variable is used, the UTF-* encodings are considered to be stateless and
  there is no need to 'unshift()' anything because the 'do_out()' function
  will eventually write all what is necessary if it gets a big enough output
  buffer. That is, the 'encoding()' members should probably return a '0' or a
  positive integer for all UTF-* encodings since the encodings are essentially
  stateless. My personal feeling is that UTF-16 is, however, a variable width
  encoding, even if the internal encoding is also UTF-16: A perverted user of
  the facet might legally present the indivual words out of order to the
  facet of a fixed width encoding, eg. in a first pass all odd words and in a
  second all even words. This would rip surrogate sequences apart causing
  unexpected errors.

The code itself is pretty readable. However, I haven't made a thorough
review of it [yet].

-- 
<mailto:dietmar_kuehl_at_[hidden]> <http://www.dietmar-kuehl.de/>
Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk