From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2007-07-01 13:00:47
Sebastian Redl <sebastian.redl_at_[hidden]> writes:
>> Perhaps the name "data stream" would be appropriate, or better yet,
>> perhaps just "stream", and use the qualified name "text stream" or
>> "character stream" to refer to streams of characters that are somehow
>> marked (either at compile-time or run-time) with the encoding.
> This might lead to ambiguity when addressing streams as a whole.
Okay, that is a good point. "Data stream" would probably be best then.
I am quite averse to "binary stream", since it would really be a misuse,
albeit a common misuse, of "binary".
>> Should text streams of arbitrary (non-Unicode encodings) be supported?
>> Also, should text streams support arbitrary base types (i.e. uint8_t or
>> uint16_t, or some other type), or should be restricted to a single type
>> (like uint16_t)?
> Each encoding requires a specific base type. For example, UTF-8 requires
> uint8_t, UTF-16 requires uint16_t, UTF-16LE and BE require uint8_t (they
> do their own combining). The current binary formatting layer would be
> used by the converter to get units of the desired format.
I see. You are suggesting, I suppose, that in addition to providing
formatting of individual values, the binary formatting layer also
provides stream filters for converting a stream of one type into a
stream of another type with a particular formatting. I like this idea.
This seems to suggest, then, that if you want to convert UTF-32 (native
endian) in a file to say, a UTF-16 (native endian) text stream, you have
to first convert the file uint8_t stream to a uint32_t stream (native
endian formatting), and then mark this uint32_t stream as UTF-32, and
then use a text encoding conversion filter to convert this UTF-32 stream
to a UTF-16 (native endian) uint16_t stream.
The trouble is, though, that if you then have a file with UTF-8 encoded
text, you have to use different types to obtain the same UTF-16 uint16_t
Furthermore, the encoding of the file might be supplied (by name) by the
user at run-time as one of a large number of supported encodings; the
base type of this encoding should not be particularly important.
> Yes, a text stream is essentially a binary stream with an encoding
> instead of a data type. So the interface is the same in description, but
> the types involved are different. I think this is mostly a documentation
Instead of a data type? But presumably both the data type and the
encoding must be specified. Also, it seems like it may be useful to be
able to specify the encoding at run-time, rather than just
>> This suggests
>> that encoding/decoding/conversion should exist as a "data stream"
> It does. That's what the character converter device does.
Well, the question is under what interface character encoding conversion
should be done. It could be a text stream to text stream interface.
>> It seems that it may be useful to allow the encoding to be specified at
> Only for the external encoding. The internal encoding should be fixed at
> compile time. Everything else is just too confusing. That's one
> important lesson I've learned trying to internationalize PHP web
What does internal or external really mean though? That is somewhat of
an artificial distinction in itself. It may be a reasonable one, but
you'll have to define those terms. Which encodings will be supported at
compile-time, then? Just UTF-8, UTF-16, and UTF-32?
-- Jeremy Maitin-Shepard
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk