Boost logo

Boost :

From: John Hayes (john.martin.hayes_at_[hidden])
Date: 2007-06-25 17:38:15


I don't think any of the transformation are accurately represented as
encoding a byte stream as text. I'll quickly address base-64 because it's
different from the others; this is a bitstream representation that happens
to tolerate being interpretted as character data in most scenarios
(base-32 also tolerates case conversion - so it's suitable for HTTP
headers).

For the other escaping, these represent a text-representation of text data.
Which is dumb sounding, but look at it from the perspective of encoding a
32-bit int - when that gets streamed, there's two choices:

1. Submit 32-bits for encoding - the stream requires enough parameters to
figure out the endianness, and then how the bits are represented.
2. Convert it into a text and submit the text - the string requires the
parameters for this conversion (base, leading 0s - printf parameters), then
downstream the text encoding has it's own parameters.

Escaping is just another way of saying "what is the text representation of
this text". There's more than could be included, however, some really basic
operators like escape operators and stream terminations operators would go a
long way towards making it easy to interpret a lot of file formats. These
operators are not on binary data, but on text data (something that can
interpret the grapheme clusters directly). Otherwise, escaping will be buggy
the first time it's applied to characters outside of the bottom 48.

At some point, the line between streaming, serialization and text parsing
gets blurred (the smarter the text parsing the deeper we go into unicode
issues) - but the interesting question to ask is what support would make
these operations implementable without rebuffering (to perform translations
that aren't immediately supported by the stream library).

The complete stack needed to support all of these requirements has a bunch
of layers, but depending on the application most of them are optional:

1. Text encoding - how are numbers formatted (are numbers going direct to
primitive encoding), how are strings escaped and delimited in a text stream.
If writing to a string buffer, then the stream may terminate here. Text
encoding may alter the character set - for example, punycode changes unicode
into ascii (which simplifies the string encode process).
2. String encoding - how do strings get reduced to a stream of primitives
(if the text format matching the encoding format then there's nothing to do
- true for SBCS, MBCS). How is a variable length string delimited in binary
(length prefixes, null termination, maximum size, padding).
3. Primitive encoding - Endianness, did we really mean IEEE 754 floats, are
we sending whole bytes or only a subset of bits (and int is expedient for a
memory image, but there are only 8 significant bits), are there alignment
issues (a file format that was originally a memory image may word-align
records or fields).
4. Bitstream encoding - if the output is octets then this layer is optional,
otherwise chop up bits into Base64 or less.

Tagging the format can be most likely be ignored at the stream level. Most
file formats will either externally or internally specify their encoding
formats. The most helpful thing to do is provide factory functions that
convert from existing character set descriptors (
http://www.iana.org/assignments/character-sets) into an actual operator and
allow changing the operators at a specific stream position. This will help
most situations where character encoding is specified in a header.

John

On 6/22/07, Jeremy Maitin-Shepard <jbms_at_[hidden]> wrote:
>
> "John Hayes" <john.martin.hayes_at_[hidden]> writes:
>
> > While working on ordinary web software, there are actually a lot more
> > variations on data encodings than just text and binary:
>
> It seems fairly logical to me to have the following organization:
>
> - Streams of arbitrary POD types
>
> For instance, you might have uint8_t streams, uint16_t streams, etc.
>
> - A byte stream would be a uint8_t stream.
>
> - A text stream holding utf-16 encoded text would be a uint16_t stream,
> while a text stream holding utf-8 encoded text would be a uint8_t
> stream. A text stream holding iso-8859-1 encoded text would also be
> a uint8_t stream.
>
> There is the issue of whether it is useful to have a special text stream
> type that is tagged (either at compile-time or at run-time) with the
> encoding in which the data either going in or out of it are supposed to
> be. How exactly this tagging should be done, and to what extent it
> would be useful, remains to be explored.
>
> It seems that your various examples of filters/encoding, like BASE-64,
> URL encoding, CDATA escaping, and C++ string escaping, might well fit
> into the framework I described in the previous paragraphs. Many of
> these filters can be viewed as encoding a byte stream as text.
>
> Let me know your thoughts, though.
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk