Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Brent Spillner (spillner_at_[hidden])
Date: 2011-01-29 14:17:55


On Jan 29, 2011, at 8:43 AM, David Bergman wrote:
>There are four layers in play here:
>1. The sequence of characters/symbols, as in CS string; totally abstract but precise... (one such CS string can be represented as Unicode or Latin-1 at #2, for instance.)
>2. The sequence of code points, in a given character set. Yes, one CS string (as in #1) can have multiple distinct manifestations at this level. They could be identical in >integral sequences.
>3. The sequence of code values, using an encoding form such as UTF-8 or UTF-16 for a Unicode code point.
>4. The byte storage representing the code values; could be a contiguous sequence of bytes or chunks, etc.

Which, IMHO, is exactly why it's (past) time to radically rethink the
standard string interface. Your four layers share the properties
that:
  (a) They can all be physically represented within the same data
structure (i.e. layer #4), and
  (b) For any layer, there already exists a lot of legacy code that
conceptualizes passed-in strings at that layer and that makes
corresponding assumptions.

To me, these facts imply that a 'view'-based approach is the right way
to go. Furthermore, views should not be concrete data structures
themselves, but simply iterator pairs (ranges) that internally point
to some arbitrary data structure, and externally present a model of a
linear sequence of atomic characters/codepoints/etc. In software
development Nirvana, APIs and algorithms will operate on
iterator_ranges at the appropriate conceptual level, rather than
intruding into the underlying data structures, and at the interfaces
with reality we'll just copy data to and from more concrete
representations using the chosen encoding. Some of those interfaces
(with low-level device drivers or with memory-mapped hardware itself)
might never go away, but many (e.g. a GUI API that accepts a string
and returns a texture of its glyphs) would hopefully eventually
migrate to the iterator model.

For example, one might write

std::string underlying_data = obtain_validated_data_from(somewhere);

text<ucs4, codept_view, decltype(underlying_data)>
codepoints(assume_utf8(underlying_data));

text<utf8, char_view, decltype(underlying_data)>
characters(assume_utf8(underlying_data));

if (codepoints.length() > characters.length())
   contains_multipoint_characters = true; // converse not necessarily
true if there are codepoints for multi-character ligatures

const char *byteptr = linear_representation(underlying_data); // may
or may not accurately track later changes, especially if the
underlying data is non-contiguous, but that's really a usage protocol
question and not for the data structure to solve--- anyway, strdup()
if you want a snapshot from a given point in time

text<utf8, char_view, characters::internal_representation> substring =
all_after(characters.begin(), characters.end(), "phrase to match");

std::vector<wide_char_t> char_array;
std::copy(substring.begin(), substring.end(), char_array.begin());

etc. When you think your usage pattern favors a rope or a chain over
a std::string, you only have to change the declaration of the
underlying_data object and the rest works seamlessly.

I think Dean favors immutability because that makes it easy for the
iterator-adaptors to cache data in the representation they need (or in
more local storage in a NUMA environment), but like specifying a
particular byte encoding for the underlying data this would be more
for convenience and performance optimization than a firm requirement.
When the underlying representation is a mutable type, read-only view
adapters are still free to cache chunks of data as long as they check
for a fingerprint that writable view adapters leave on the underlying
structure whenever they commit a change. Consider the underlying data
structure as a sequence of arbitrary 'tokens' (bytes, characters,
codepoints, variable-length compressed bitstrings... whatever makes
sense); a read-only substring cache that depends on tokens i though j
(inclusive) of version v of the underlying representation can be
described by the tuple r(v, i, j), while a serialized write operation
that changes the tokens in positions i though j from version v to
version v+1 can be described by the tuple w(v+1, i, j). A write w(v,
x, y) invalidates all or part of a read cache r(v', x', y') if and
only if v' < v && x' <= y && y' >= x (and if partially invalidated, we
know exactly which segment(s) are still OK.) The dereference operator
of the read-only iterator then needs to check its cached v' against
the underlying data structure's latest version, and if necessary skim
through a small ring buffer of recent w(v, x, y) tuples to see if part
of its cache must be invalidated (if v' + 1 is less than the earliest
v in the ring, the reader can conservatively invalidate its entire
cache.) This is a negligible amount of overhead (in the common case,
a single fetch, compare, and predictable branch, and in the worst case
a new transcoding that you would have had to perform anyway) per
character/codepoint/byte access, and when even that is unacceptable
you can simply cache a representation of the string in a buffer of
known encoding and make everything else adapt to /your/ data layout.
I'm intentionally not delving into the details of synchronizing reads
and writes between different threads, as std::string and char[] don't
provide that either, but if you really need some views to provide
their own synchronization this is a (relatively) straightforward place
to build it in.

>1. Why can't this byte storage type not be used for all kinds of things; is not 'string' a quite bad name for it, since it is neither a string according to most programming >languages (see above) nor according to that CS definition that you are alluding to (unless you consider uninterpreted bytes to be symbols, but be quite aware that those >'symbols' would have nothing - or very little - to do with the symbols of the text represented through your construct.)

Wholeheartedly agree; the view should have a name like 'text' (or
'characters', 'codepoints', 'bytes', etc.) and the underlying data
structure can be string, rope, vector<char>, or
boost::not_invented_yet.

Sure, views don't completely solve the problem of legacy APIs that
require a persistent, writable reference to a block of text (e.g.a
text entry widget that maps its input to a character buffer); if
you're lucky to can map the widget to the underlying_representation
structure and create the views you need for other parts of the
program. If you're not lucky, you might have to make a separate copy
for each legacy interface and check periodically or on-demand for
changes, but I'd still much rather use a data structure that forces me
to make explicit copies than one that encourages me to risk sharing a
single buffer between APIs that may not make the same encoding
assumptions.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk