Boost logo

Boost :

From: Jeremy Maitin-Shepard (jbms_at_[hidden])
Date: 2007-07-02 12:12:51


Sebastian Redl <sebastian.redl_at_[hidden]> writes:

[snip]

> Platforms using 9-bit bytes have need for binary I/O, too. They might
> have need for doing it in their native 9-bit units. It would be a shame
> to deprive them of this possibility just because the text streams
> require octets. Especially if we already have a layer in place whose
> purpose is to convert between low-level data representations.

It seems that the primary interface for the data formatting layer should
be in terms of fixed-size types like {u,}int{8,16,32,64}_t. It is more
the job of a serialization library to support platform-dependent types
like short,int,long, etc., which would be of use primarily for producing
serialization output that will only be used as input to the exact same
program.

I suppose an alternative is for the read/write functions in the data
formatting layer to always specify an explicit number of bits. For
example,
write_{u,}int<32> or read_{u,}int<32>.

read_int<N> always returns intN_t, and it is a compile-time error if
that type does not exist.

write_int<N> casts its argument to intN_t, and thus avoids the issue of
multiple names for the same type, like int/long on most 32-bit
platforms/compilers.

This interface supports architectures with a 36-bit word
(e.g. write_int<36>), but since everything is made explicit, avoids any
confusion that might otherwise result from such support.

Floating point types are somewhat more difficult to handle, and I'm not
sure what is the best approach. One possibility is to also specify the
number of bits explicitly, and to assume the IEEE 754 format will be
used as the external format. For example,

write_float<32> or write_float<64>

or perhaps

write_ieee754<32>.

It should just be a compile-time error if the compiler/platform doesn't
provide a suitable type.

[snip]

>> It seems like trying to support unusual architectures at all may be
>> extremely difficult. See my other post.
>>
> Which other post is this?

My comments there probably weren't very important anyway.

I think it is worth considering, though, that given the rarity of non
8-bit byte platforms, it is probably not worth spending very much time
in supporting them, and more importantly, it is not worth complicating
the interface for 8-bit byte platforms in order to support them.

>> I suppose if you can find a clean way to support these unusual
>> architectures, then all the better.
>>
>> It seems that it would be very hard to support e.g. utf-8 on a platform
>> with 9-bit bytes or which cannot handle types smaller than 32-bits.
>>
> I think the binary conversion can do it. The system would work
> approximately like this:
> 1) Every platform defines its basic I/O byte. This would be 8 bits for
> most computers (including those where char is 32 bits large), 9 or some
> other number of bits for others. The I/O byte is the smallest unit that
> can be read from a stream.
> 2) Most platforms will additionally designate an octet type. Probably I
> will just use uint8_t for this. They will supply a Representation for
> the formatting layer that can convert a stream of I/O bytes to a stream
> of octets. (E.g. by truncating each byte.) If an octet stream is then
> needed (e.g. for creating a UTF-8 stream) this representation will be
> inserted.

This padding/truncating would need to be done as an explicit way of
encoding an octet stream as a nonet stream, and should probably not be
done implicitly, unless this sort of conversion is always assumed on
those platforms.

> 3) Platforms that do not support octets at all (or simply do not have a
> primitive to spare for unambiguous overloads - they could use another
> 9-bit type and just ignore the additional byte; character streams, at
> least, do not perform arithmetic on their units so overflow is not an
> issue) do not have support for this. They're off bad. I think this case
> is rare enough to be ignored.

Okay.

>>> The return type of mark(), on the other hand, can and should be opaque.
>>> This allows for many interesting things to be done. For example:
>>> Consider a socket. It has no mark/reset, let alone seeking support.
>>> You have a recursive descent parser that requires multiple mark/reset
>>> support.
>>>
>>
>> I see. It still seems that using different names means that something
>> that requires only mark/reset support cannot use a stream providing
>> seek/tell support, without an additional intermediate layer.
>>
> Well, depends. Let's assume, for example, that the system will be
> implemented as C++09 templates with heavy use of concepts.

I think it may not be a good idea to target this new I/O library to a
language that does not yet exist, and which more importantly is not yet
supported by any compiler, except perhaps Douglas Gregor's experimental
ConceptGCC, which as the release notes state, is extremely slow,
although the release notes also claim that performance can be improved.
I suppose it may work fine to write the library (using the preprocessor)
so that it can be compiled under existing compilers without concept
support, and include a small amount of additional functionality/use more
convenient syntax if concept support is available.

I would be very, very wary of anything that would increase the
compile-time for users of the library, though.

[snip]

>> The reason would be for a protocol in which little/big endian is
>> specified as part of the message/data, and a typical implementation
>> would always write in native format (and so it would need to determine
>> which is the native format), but support both formats for reading.
>>
> Hmm ... makes sense. I'm not really happy, but it makes sense.

What do you mean you're not happy? I think all that would really be
needed would be a macro to indicate the endianness. Of course any code
that depends on this would likely depend even more on 8-bit bytes, but
that is another issue.

>> Ideally, the cost of the virtual function calls would normally be
>> mitigated by calling e.g. read/write with a large number of elements at
>> once, rather than with only a single element.
>>
> Yes, but that's the ideal case. In practice, this means that the
> application would have to do its own buffering even if it really wants
> the data unit by unit.

Possibly this issue can be mitigated by exposing in the types only a
buffer around a text stream, although I agree that there is no perfect
solution.

> The programmer will not want to construct the
> complicated full type for this.

> newline_filter<encoding_device<utf_8,
> native_converter<gzip_filter<buffering_filter<file_device> > > > > chain & =
> open_file(filename,
> read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines());

> The programmer will want to simply write

> text_stream<utf_8> chain =
> open_file(filename,
> read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines());

I notice that these code examples suggest that all streams will be
reference counted (and cheaply copied). Is that the intention? A
potential drawback to that approach is that a buffer filter would be
forced to allocate its buffer on the heap, when it otherwise might be
able to use the stack.

[snip]

> It is convenient to have a unified concept of a character, independent
> of its encoding. The Unicode charset provides such a concept. Unicode is
> also convenient in that it adds classification rules and similar stuff.
> This decision is not really visible to user code anyway, only to
> encoding converters: it should be sufficient to provide a conversion
> from and to Unicode code points to enable a new encoding to be used in
> the framework.

I am basically content using only Unicode for text handling in my own
programs, but I think it would be useful to see what others that care
about efficiency for certain operations (and work with languages that
are not represented very efficiently using UTF-8) think about this.

-- 
Jeremy Maitin-Shepard

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk