Boost logo

Boost :

From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2007-06-30 10:46:36


Hi John,

I'm responding to both your mails in a single reply (and mixing your
quotes), because they are closely interrelated.

John Hayes wrote:
> While working on ordinary web software, there are actually a lot more
> variations on data encodings than just text and binary:
And not only in web software. This is exactly what the filters and
devices are supposed to support. However, with some encodings, the line
between filters and devices is a bit blurred.
> A binary format may itself be encoded as bytes (of varying endianess), or in
> Base64 for email attachments (RFC 2045) or Base32 for URLs or form post data
> (RFC 3548).
> I don't think any of the transformation are accurately represented as
> encoding a byte stream as text. I'll quickly address base-64 because it's
> different from the others; this is a bitstream representation that happens
> to tolerate being interpretted as character data in most scenarios
> (base-32 also tolerates case conversion - so it's suitable for HTTP
> headers).
>
>From my understanding of Base-64, I'd say I disagree. Base-64 is not a
bitstream representation that tolerates being interpreted as characters.
This would mean that the bit pattern for the Base-64 version of a given
blob is defined. That's not the case, though. The Base-64 transformation
is defined in terms of abstract characters: the bit-hextet 000000
corresponds to A, 000001 to B, and so on. The actual representation of
these characters does not matter - cannot matter! The encoding was
designed to survive re-encoding of the resulting text.
Therefore, writing a Base64Device that wraps a character stream and
provides a binary stream seems to be a very appropriate way of
implementing Base-64 to me.
> When encoding in a plain-text format (after encoding into a narrow character
> set), there might still be escaping depending on the container. C, JS, XML
> attributes, elements and CDATAs, SQL (by database) all have different
> escaping rules. This fails to mention sillier issues like newline
> representation.
>
> For the other escaping, these represent a text-representation of text data.
>
This is what text filters are for. While non-trivial, it would certainly
be possible to implement stateful filters that can escape string
literals. Or you can implement simpler filters that do the encoding but
are context-insensitive. (Then you're responsible for inserting and
removing the filters from your chain as context requires.)
> but the interesting question to ask is what support would make
> these operations implementable without rebuffering (to perform translations
> that aren't immediately supported by the stream library).
>
I think in a system that works by combining components (and I think
everything else would be too inflexible) you cannot implement
functionality that changes the size of the data without rebuffering, at
least with a small buffer. Any quoting means that for an incoming
character, two characters might get forwarded. Or one for two, going in
the other direction. A Base-64 translator must always buffer some data,
because it needs groups of 3 bytes before it can do encoding, groups of
4 characters before it can do decoding. This means some buffering.
> Buffering is also an interesting problem because in some formats, buffering
> events (like flush overflow or EOF) have streaming output to indicate an
> explicit end of stream, minimum remaining distance or differences in
> distance (like how many bytes to the next chunk in a stream).
That's an excellent and important observation. But I think my current
design supports this.
> From my limited reasearch, the most complete description of a stream encoding is hidden in the description of HTTP 1.1 entities - this defines a
> 3-layer model for streaming:
>
> Buffering events: How to determine how large the stream is (TE,
> Content-Length, Trailer headers)
I don't think this should be part of the stream stack. Determining the
size looks like an application concern to me. The stream simply supplies
the data the application requests. (Or tries to.)
> Transformations: Preprocessing required before the stream can be
> interpretted (Content-Encoding: gzip, deflate, could include byte encodings)
This would be the domain of filters. However, determining the required
transformations is still an application issue. Like above, I don't think
the stream stack should build itself based on data it parses. (However,
it would be an interesting domain for a support library.)
> Type: What class should further interpret the content, and for text
> entities, the character set encoding (Content-Type).
Same thing. Let the application find out the encoding and what to do
with the data.
> 1. Text encoding - how are numbers formatted (are numbers going direct to
> primitive encoding), how are strings escaped and delimited in a text stream.
> If writing to a string buffer, then the stream may terminate here. Text
> encoding may alter the character set - for example, punycode changes unicode
> into ascii (which simplifies the string encode process).
>
All except the first could be accomplished using text filters. The first
seems to be a very domain-specific question that is better handled by
the decision of which interface - the binary or the text - to use in the
first place.
> 2. String encoding - how do strings get reduced to a stream of primitives
> (if the text format matching the encoding format then there's nothing to do
> - true for SBCS, MBCS).
This would be the character conversion device.
> How is a variable length string delimited in binary
> (length prefixes, null termination, maximum size, padding).
>
This looks like a question of serialization to me, and thus outside the
domain of the library.
> 3. Primitive encoding - Endianness, did we really mean IEEE 754 floats,
That's the binary formatting.
> are
> we sending whole bytes or only a subset of bits (and int is expedient for a
> memory image, but there are only 8 significant bits),
Interesting idea here. May be binary formatting, may be serialization,
may be a simple matter of casting the data before feeding it to the
stream. The main problem I see in integrating this into the stream is
that it is highly context-dependent. Which int is there because the
range is needed, and which is only there because the hardware processes
it faster? There could be both kinds within a single structure, which is
why I'm inclined to leave this to the application or the serialization.
> are there alignment
> issues (a file format that was originally a memory image may word-align
> records or fields).
>
Another serialization issue.
> 4. Bitstream encoding - if the output is octets then this layer is optional,
> otherwise chop up bits into Base64 or less.
>
Binary filters can do this, although as I argued above, I don't think
Base-64 is a good example of such a use.
> Tagging the format can be most likely be ignored at the stream level. Most
> file formats will either externally or internally specify their encoding
> formats.
I don't think it's even possible, with reasonable effort, to support
this at the stream level. Tagging is very dependent on the data format.
> The most helpful thing to do is provide factory functions that
> convert from existing character set descriptors (
> http://www.iana.org/assignments/character-sets) into an actual operator and
> allow changing the operators at a specific stream position. This will help
> most situations where character encoding is specified in a header.
>
Yes, I agree. The semantics of changing the stack in the middle of the
stream must be defined.

Sebastian Redl


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk