|
Boost : |
From: Sebastian Redl (sebastian.redl_at_[hidden])
Date: 2007-07-01 13:54:46
Jeremy Maitin-Shepard wrote:
> Okay, that is a good point. "Data stream" would probably be best then.
> I am quite averse to "binary stream", since it would really be a misuse,
> albeit a common misuse, of "binary".
>
I'm using "unstructured stream" in the next iteration of the design
document. Does that seem appropriate to you?
> I see. You are suggesting, I suppose, that in addition to providing
> formatting of individual values, the binary formatting layer also
> provides stream filters for converting a stream of one type into a
> stream of another type with a particular formatting. I like this idea.
>
Yes, exactly. This can be very useful for reading of data. However, it's
not quite sufficient for runtime selection of a character encoding. For
this, an interface that is not a stream of a single data type, but
rather provides extraction of any data type at any time, is required.
> This seems to suggest, then, that if you want to convert UTF-32 (native
> endian) in a file to say, a UTF-16 (native endian) text stream, you have
> to first convert the file uint8_t stream to a uint32_t stream (native
> endian formatting), and then mark this uint32_t stream as UTF-32, and
> then use a text encoding conversion filter to convert this UTF-32 stream
> to a UTF-16 (native endian) uint16_t stream.
>
Again, not quite. I suppose I should first define external and internal
encoding, as you suggest below, because it is fundamental to this issue.
I am of the opinion that an application does not gain anything by having
string-like data for processing that is not of an encoding known at
compile time. For any given string the application actually processes,
the encoding of the data must be known at compile time. Everything else
is a mess. String types should be tagged with the encoding. Text streams
should be tagged. Buffers for text should be tagged. This
compile-time-known encoding is the internal encoding. (An application
can have different internal encodings for different data, but not for
the same piece of data.)
The external encodings are what external data is in. Files, networks
streams, user input, etc. When reading a file as text, the application
must specify the external encoding of this file. (Or it can fall back to
a default, but this is often unacceptable.) The external encoding must
be specifiable at run time, obviously, because different files,
different network connections, can be in different encodings.
Suppose, then, that my application uses UTF-16 internally for
everything. Endian does not matter - does not exist, even, because
UTF-16 uses uint16_t as the underlying type, and to the view of C++,
endianness doesn't matter as long as the type isn't viewed as its
components.
To read in a text file in UTF-32, native endian, I could do this
(tentative interface), if I know at compile time that the file is UTF-32:
text_input_stream<utf_16> stream = open_file_input("filename.txt") // a
file_input_stream
.filter(buffer()) // a buffered_input_stream<iobyte, file_input_stream>
.filter(assembler<uint32_t, native>()) // a
assembler_input_stream<uint32_t, native, buffered_input_stream<iobyte,
file_input_stream>>
.filter(text_decode<utf_16, utf_32>()) // a text_decode_stream<utf_32,
assembler_input_stream<uint32_t, native, buffered_input_stream<iobyte,
file_input_stream>>>
;
More likely, however, I would do this:
auto assembler =
generic_assembler<native_rules>(open_file_input("filename.txt").filter(buffer());
text_input_stream<utf_16> stream = text_decoder<utf_16>(assembler,
"UTF-32");
assembler would be of type generic_assembler_t<native_rules,
buffered_input_stream<iobyte, file_input_stream>> and would provide a
single template member, read<T>(buffer<T> &target), that allows
extracting any (primitive) type. The assembler follows the rules to
provide the type. The text_decoder would then call this function using a
type determined by the encoding specified. Yes, the read function would
be instantiated for all types regardless of whether it's used, because
the encoding is a run time decision. But at one point, you need to
bridge the gap between compile time and run time decisions.
Yet another alternative would be a direct_text_decoder<utf_16> that
reads from a uint8_t stream and expects either that the encoding
specifies the endianness (like "UTF-16BE") or that a byte order mark is
present. Such a decoder would not be able to decode encodings where
neither is the case.
>> Yes, a text stream is essentially a binary stream with an encoding
>> instead of a data type. So the interface is the same in description, but
>> the types involved are different. I think this is mostly a documentation
>> issue.
>>
>
> Instead of a data type? But presumably both the data type and the
> encoding must be specified. Also, it seems like it may be useful to be
> able to specify the encoding at run-time, rather than just
> compile-time.
>
Instead of a data type. The data type of a text_stream<Encoding> is
base_type<Encoding>::type. This is for internal use - the external use
is different anyway.
> Which encodings will be supported at
> compile-time, then? Just UTF-8, UTF-16, and UTF-32?
>
Whichever the library supplies. I think these three plus ASCII and
Latin-1 would make a reasonable minimum requirement.
Sebastian Redl
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk