From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2007-06-30 12:41:12
Jeremy Maitin-Shepard wrote:
> Sebastian Redl <sebastian.redl_at_[hidden]> writes:
>> Because the exact unit of transport is still in the open (and the
>> current tendency I see is toward using octets, and leaving native bytes
>> to some other mechanism), I didn't want any such implication in the name.
>> The name binary isn't a very good choice either, I admit. In the end,
>> all data is binary. But the distinction between "binary" and "textual"
>> data is important, and not only at the concept level. What I have in my
>> mind works something like this:
>> Binary data is in terms of octets, bytes, primitives, or PODs,
> Perhaps the name "data stream" would be appropriate, or better yet,
> perhaps just "stream", and use the qualified name "text stream" or
> "character stream" to refer to streams of characters that are somehow
> marked (either at compile-time or run-time) with the encoding.
This might lead to ambiguity when addressing streams as a whole.
> Should text streams of arbitrary (non-Unicode encodings) be supported?
> Also, should text streams support arbitrary base types (i.e. uint8_t or
> uint16_t, or some other type), or should be restricted to a single type
> (like uint16_t)?
Each encoding requires a specific base type. For example, UTF-8 requires
uint8_t, UTF-16 requires uint16_t, UTF-16LE and BE require uint8_t (they
do their own combining). The current binary formatting layer would be
used by the converter to get units of the desired format.
> The reason for substantial unification between both "data streams" and
> "text streams" is that despite differences in how the data they
> transport is used, the interface should essentially be the same (both
> basic read/write, as well as things like mark/reset and seeking), and a
> buffering facility should be exactly the same for both types of streams.
> Similarly, facilities for providing mark/reset support on top of a
> stream that does not support it by using buffering would be exactly the
> same for both "binary" and "text" streams.
> Even if seek may not be as useful for text streams, it still might be
> useful to some people, and there is no reason to exclude it.
> In the document you posted, for instance, you essentially just
> duplicated much of the description of binary streams for the text
> streams, which suggests a problem.
Yes, a text stream is essentially a binary stream with an encoding
instead of a data type. So the interface is the same in description, but
the types involved are different. I think this is mostly a documentation
> As a suggest below, a "text stream" might always be a very thin layer on
> top of a binary stream, that simply specifies an encoding. The issue,
> though, is how it would work to layer something like a regular buffer or
> a mark/reset-providing buffer on top of a text stream. There shouldn't
> have to be two mark/reset providers, one for data streams, and one for
> text stream, but also it should be possible to layer such a thing on top
> of a text stream directly, and still maintain the encoding annotation.
While this would be nice, I'm not sure if the C++ type system supports
such a thing. The library might provide templates that can do such a thing.
> This suggests
> that encoding/decoding/conversion should exist as a "data stream"
It does. That's what the character converter device does.
> One thing I haven't figured out, though, it how the
> underlying unit type of the stream, i.e. uint8_t or uint16_t, would
> correspond to the encoding. In particular, the issue is what underlying
> unit type corresponds to each of the following encodings:
> - UTF-8, iso-8859-* (it seems obvious that uint8_t would be the choice
> - UTF-16 (uint16_t looks promising, but you need to be able to read
> this from a file, which might be a uint8_t stream)
> - UTF-16-LE/UTF-16-BE (uint16_t looks promising, but also
> uint16_le_t/uint16_be_t (a special type that might be defined in the
> endian library) might be better, and furthermore you need to be able
> to read this from a file, which might be a uint8_t stream)
> Perhaps you have some ideas about the conceptual model that resolves
> these issues.
I have. There is a stream operation that allows conversion of the raw
stream (of type octet or iobyte or something) into a stream of another
primitive. That's the binary formatting layer.
> It seems that it may be useful to allow the encoding to be specified at
Only for the external encoding. The internal encoding should be fixed at
compile time. Everything else is just too confusing. That's one
important lesson I've learned trying to internationalize PHP web