Boost logo

Boost :

From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2007-06-24 20:28:15


Sebastian Redl <sebastian.redl_at_[hidden]> writes:

> Jeremy Maitin-Shepard wrote:
>> - One idea from [Boost.IOStreams] to consider is the
>> direct/indirect device distinction.
>>
> I never noticed this distinction before. It seems useful, but there are
> issues not unlike the AsyncIO issues.
> Direct devices provide a different interface. A programmer can take
> advantage of this interface for some purposes, but for most, I fear, the
> advantages would be lost. Consider:
> - A direct device cannot be wrapped by filters that do dynamic data
> rewriting (such as (de)compression). The random access aspect would be lost.
> - A direct device cannot participate in the larger stack without
> propagating the direct access model throughout the stack. (And this
> stops at the text level anyway, because the character recoder does
> dynamic data rewriting.) Propagating another interface means a lot of
> additional implementation effort and complexity.

Okay. I'm inclined to agree with this.

>> - Binary transport layer issue:
>>
>> Platforms with unusual features, like 9-bit bytes or inability to
>> handle types less than 32-bits in size can possibly still implement
>> the interface for a text/character transport layer, possibly on top of
>> some other lower-level transport that need not be part of the boost
>> library. Clearly, the text encoding and decoding would have to be
>> done differently anyway.
>>
> A good point, but it does mean that the text layer dictates how the
> binary layer has to work. Not really desirable when pure binary I/O has
> nothing to do with text I/O.

I'm not sure what you mean by this exactly.

> One approach that occurs to me would be to make the binary transport
> layer use a platform-specific byte type (octets, nonets, whatever) and
> have the binary formatting layer convert this into data suitable for
> character coding.

It seems like trying to support unusual architectures at all may be
extremely difficult. See my other post.

I suppose if you can find a clean way to support these unusual
architectures, then all the better.

It seems that it would be very hard to support e.g. utf-8 on a platform
with 9-bit bytes or which cannot handle types smaller than 32-bits.

>> - Seeking:
>>
>> Maybe make multiple mark/reset use the same interface as seeking, for
>> simplicity. Just define that a seeking device has the additional
>> restriction that the mark type is an offset, and the argument to seek
>> need not be the result of a call to tell.

>>
>> Another issue is whether to standardize the return type from tell,
>> like std::ios_base::streampos in the C++ iostreams library.
>>
> These are incompatible requirements, and the reason I want to keep the
> interfaces separate. Standardizing the tell return type is a good idea
> and necessary for efficient work of type erasure and simple use of
> arbitrary values in seek(). The type must be transparent.

> The return type of mark(), on the other hand, can and should be opaque.
> This allows for many interesting things to be done. For example:
> Consider a socket. It has no mark/reset, let alone seeking support.
> You have a recursive descent parser that requires multiple mark/reset
> support.

I see. It still seems that using different names means that something
that requires only mark/reset support cannot use a stream providing
seek/tell support, without an additional intermediate layer.

[snip]

>> It should probably also
>> be possible to determine using the library at compile time what the
>> native format is.
> To what end? If the native format is one of the special predefined ones,
> it will hopefully be optimized in the platform-aware special
> implementation (well, I can dream) anyway.

The reason would be for a protocol in which little/big endian is
specified as part of the message/data, and a typical implementation
would always write in native format (and so it would need to determine
which is the native format), but support both formats for reading.

>> - Header vs Precompiled:
>>
>> I think as much should be separately compiled as possible, but I also
>> think that type erasure should not be used in any case where it will
>> significantly compromise performance.
>>
> I'm thinking of a system where components are templates on the component
> they wrap, so as to allow direct calls upwards. I'm thinking of using
> the common separately compiled template specialization extension of
> compilers to provide pre-compiled versions of the standard components
> instantiated with the erasure components. This is very similar to how
> Spirit works, except that it doesn't have pre-compiled stuff. In Spirit,
> rule is the erasure type, but the various parsers can be directly
> linked, too.

Ideally, the cost of the virtual function calls would normally be
mitigated by calling e.g. read/write with a large number of elements at
once, rather than with only a single element.

> Then, if the performance is needed, the programmer can hand-craft his
> chain so that no virtual calls are made, at the cost of compiling his
> own copy of the components.

> I'm afraid I don't see a better way of doing this. I'm wide open to
> suggestions.
>> - The "byte" stream and the character stream, while conceptually
>> different, should probably both be considered just "streams" of
>> particular POD types.
> I have explained in a different post why I don't think this is a good idea.
>> - Text transport:
>>
>> I don't think this layer should be restricted to Unicode encodings.
>>
> I have no plans of doing so. I just consider all encodings as encodings
> of the universal character set. An encoding is defined by how it maps
> the UCS code points onto groups of octets, words, or other primitives.

Is it in fact the case that all character encodings that are useful to
support encode only a subset of Unicode? (i.e. there does not exist a
useful encoding that can represent a character that cannot be
represented by Unicode?)

In any case, though, it is not clear exactly why there is a need to
think of an arbitrary character encoding in terms of Unicode, except
when explicitly converting between that encoding and a Unicode encoding.

[snip]

-- 
Jeremy Maitin-Shepard

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk