Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-24 01:45:08


On 01/23/2011 06:16 PM, Dean Michael Berris wrote:
> On Sat, Jan 22, 2011 at 5:01 AM, Nevin Liber<nevin_at_[hidden]> wrote:
>> On 21 January 2011 06:07, Dean Michael Berris<mikhailberis_at_[hidden]>wrote:
>>
>>> 4. Looks like a real STL container except the iterator type is smarter
>>> than your average iterator.
>>>
>> What does "smarter" mean?
> The way I was thinking about it, "smarter" would mean something along
> the lines of "knows more than your average<thing>" where<thing> is a
> bare iterator.
>
> In the context of strings, I was thinking it should be able to know
> what string it came from, what encoding is the string supposed to be
> interpreted in, or whether there are special computations that an
> iterator for string might need. One example that comes to mind is
> having a tokenizing iterator which returns a string when dereferenced
> and knows what the delimiters of the string are -- to do that
> correctly your iterator would need to know which string it came from
> and where in the string its internal "counter" is already "parked" at
> from the last dereference.
>
> This would require that iterators be built externally from the string,
> something like:
>
> auto it = encoded<utf8_encoding>(original_string), end =
> encoded<utf8_encoding>();
I like that idea but was toying with a different paradigm. A template
argument similar to a locale in that it would contain the information
needed to compare elements and to iterate elements. If it made sense to
change things an imbue idea for comparisons and iterators could work.

> Here `it` could interpret the original string as UTF-8 and you can
> possibly assume that dereferencing this iterator can return an
> appropriate (possibly variant) type that is convertible to the
> appropriate holder (char, wchar_t, uint32_t (for utf32)). From here
> you can build ranges appropriately and deal with ranges and just know
> that the encoding is explicitly defined in the iterator.
I like the idea of segmenting an encoded string into ranges where a
"character" would be a range capturing one or more of the underlying
encoding's characters and combining characters. It solves the problem
of what to return when dereferencing the iterators. Of course you'd
have to be able to compare two (for example) utf-16 ranges meaningfully
based on some locale, just as if a human who knew the symbols was
comparing the glyphs that would be drawn for each range. Another idea I
like though, is that dereferencing an iterator would return one UCS
codepoint and it would be up to a higher level of abstraction to fetch
the combining characters and form the final glyph. That way, any string
that encoded UCS, whether it was utf-32, utf-16, or utf-8, could return
char32_t from dereferencing an iterator. I suspect that either or both
of these as well as other variations would at times be the better idea,
because the interpretation of the underlying code varies so much. Lots
of places share the same scripts but with quite different rules about
what to do with them, and how to combine or compare them. Beware of a
naive solution if the intent is to make a completely general solution.
I'm not even sure if it's possible without doing a layered approach.

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk