Boost logo

Boost :

From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2007-06-30 11:45:05


Jeremy Maitin-Shepard wrote:
> Sebastian Redl <sebastian.redl_at_[hidden]> writes:
>
>
>> Jeremy Maitin-Shepard wrote:
>>
>>> - Binary transport layer issue:
>>>
>>> Platforms with unusual features, like 9-bit bytes or inability to
>>> handle types less than 32-bits in size can possibly still implement
>>> the interface for a text/character transport layer, possibly on top of
>>> some other lower-level transport that need not be part of the boost
>>> library. Clearly, the text encoding and decoding would have to be
>>> done differently anyway.
>>>
>>>
>> A good point, but it does mean that the text layer dictates how the
>> binary layer has to work. Not really desirable when pure binary I/O has
>> nothing to do with text I/O.
>>
>
> I'm not sure what you mean by this exactly.
>
Platforms using 9-bit bytes have need for binary I/O, too. They might
have need for doing it in their native 9-bit units. It would be a shame
to deprive them of this possibility just because the text streams
require octets. Especially if we already have a layer in place whose
purpose is to convert between low-level data representations.
>
>> One approach that occurs to me would be to make the binary transport
>> layer use a platform-specific byte type (octets, nonets, whatever) and
>> have the binary formatting layer convert this into data suitable for
>> character coding.
>>
>
> It seems like trying to support unusual architectures at all may be
> extremely difficult. See my other post.
>
Which other post is this?
> I suppose if you can find a clean way to support these unusual
> architectures, then all the better.
>
> It seems that it would be very hard to support e.g. utf-8 on a platform
> with 9-bit bytes or which cannot handle types smaller than 32-bits.
>
I think the binary conversion can do it. The system would work
approximately like this:
1) Every platform defines its basic I/O byte. This would be 8 bits for
most computers (including those where char is 32 bits large), 9 or some
other number of bits for others. The I/O byte is the smallest unit that
can be read from a stream.
2) Most platforms will additionally designate an octet type. Probably I
will just use uint8_t for this. They will supply a Representation for
the formatting layer that can convert a stream of I/O bytes to a stream
of octets. (E.g. by truncating each byte.) If an octet stream is then
needed (e.g. for creating a UTF-8 stream) this representation will be
inserted.
3) Platforms that do not support octets at all (or simply do not have a
primitive to spare for unambiguous overloads - they could use another
9-bit type and just ignore the additional byte; character streams, at
least, do not perform arithmetic on their units so overflow is not an
issue) do not have support for this. They're off bad. I think this case
is rare enough to be ignored.
>> The return type of mark(), on the other hand, can and should be opaque.
>> This allows for many interesting things to be done. For example:
>> Consider a socket. It has no mark/reset, let alone seeking support.
>> You have a recursive descent parser that requires multiple mark/reset
>> support.
>>
>
> I see. It still seems that using different names means that something
> that requires only mark/reset support cannot use a stream providing
> seek/tell support, without an additional intermediate layer.
>
Well, depends. Let's assume, for example, that the system will be
implemented as C++09 templates with heavy use of concepts. The concepts
for multimark/reset and tell/seek could look like this:

typedef implementation-defined streamsize;
enum start_position { begin, end, current };
template <typename T>
concept Seekable
{
  streamsize tell(T);
  void seek(T, start_position, streamsize);
}

template <typename T>
concept MultiMarkReset
{
  typename mark_type;
  mark_type mark(T);
  void reset(T, mark_type);
}

Now it's trivial to make every Seekable stream also support mark/reset
by means of this simple concept map:

template <Seekable T>
concept_map MultiMarkReset<T>
{
  typedef streamsize mark_type;
  mark_type mark(const T &t) { return tell(t); }
  void reset(T &t, mark_type m) { seek(t, begin, m); }
}

> The reason would be for a protocol in which little/big endian is
> specified as part of the message/data, and a typical implementation
> would always write in native format (and so it would need to determine
> which is the native format), but support both formats for reading.
>
Hmm ... makes sense. I'm not really happy, but it makes sense.
> Ideally, the cost of the virtual function calls would normally be
> mitigated by calling e.g. read/write with a large number of elements at
> once, rather than with only a single element.
>
Yes, but that's the ideal case. In practice, this means that the
application would have to do its own buffering even if it really wants
the data unit by unit. The programmer will not want to construct the
complicated full type for this.

newline_filter<encoding_device<utf_8,
native_converter<gzip_filter<buffering_filter<file_device> > > > > chain & =
  open_file(filename,
read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines());

The programmer will want to simply write

text_stream<utf_8> chain =
  open_file(filename,
read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines());

But text_stream does type erasure and thus has a virtual call for
everything. If the user now proceeds to read single characters from the
stream, that's one virtual call per character. And I don't think this
can really be changed. It's better than a fully object-oriented design,
where every read here would actually mean 3 or more virtual calls down
the chain. (That's the case in Java's I/O system, for example.)
> Is it in fact the case that all character encodings that are useful to
> support encode only a subset of Unicode? (i.e. there does not exist a
> useful encoding that can represent a character that cannot be
> represented by Unicode?)
>
I think it is. If it isn't, that's either a defect the Unicode
consortium will want to correct by adding the characters to Unicode, or
the encoding is for really unusual stuff, such as Klingon text or Elven
Tengwar runes. They can be seen as mappings to the private regions of
Unicode and are by nature not convertible to other encodings.
One possible exception is characters that only exist in Unicode as
grapheme clusters but may be directly represented in other encodings.
> In any case, though, it is not clear exactly why there is a need to
> think of an arbitrary character encoding in terms of Unicode, except
> when explicitly converting between that encoding and a Unicode encoding.
>
It is convenient to have a unified concept of a character, independent
of its encoding. The Unicode charset provides such a concept. Unicode is
also convenient in that it adds classification rules and similar stuff.
This decision is not really visible to user code anyway, only to
encoding converters: it should be sufficient to provide a conversion
from and to Unicode code points to enable a new encoding to be used in
the framework.

Sebastian Redl


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk