Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-24 06:02:07


On Mon, Jan 24, 2011 at 2:45 PM, Patrick Horgan <phorgan1_at_[hidden]> wrote:
> On 01/23/2011 06:16 PM, Dean Michael Berris wrote:
>>
>> In the context of strings, I was thinking it should be able to know
>> what string it came from, what encoding is the string supposed to be
>> interpreted in, or whether there are special computations that an
>> iterator for string might need. One example that comes to mind is
>> having a tokenizing iterator which returns a string when dereferenced
>> and knows what the delimiters of the string are -- to do that
>> correctly your iterator would need to know which string it came from
>> and where in the string its internal "counter" is already "parked" at
>> from the last dereference.
>>
>> This would require that iterators be built externally from the string,
>> something like:
>>
>>   auto it = encoded<utf8_encoding>(original_string), end =
>> encoded<utf8_encoding>();
>
> I like that idea but was toying with a different paradigm.  A template
> argument similar to a locale in that it would contain the information needed
> to compare elements and to iterate elements.  If it made sense to change
> things an imbue idea for comparisons and iterators could work.
>

Right, unfortunately that kind of information doesn't seem to fit well
to be made part of the string's type. I would think that algorithms
that apply to the string should be external and just leave the string
type to behave like a value that you can deal with. The simple reason
for why this kind of information would be best made external is in the
case of runtime switching on what kind of encoding you may want to
interpret data in.

>> Here `it` could interpret the original string as UTF-8 and you can
>> possibly assume that dereferencing this iterator can return an
>> appropriate (possibly variant) type that is convertible to the
>> appropriate holder (char, wchar_t, uint32_t (for utf32)). From here
>> you can build ranges appropriately and deal with ranges and just know
>> that the encoding is explicitly defined in the iterator.
>
> I like the idea of segmenting an encoded string into ranges where a
> "character" would be a range capturing one or more of the underlying
> encoding's characters and combining characters.  It solves the problem of
> what to return when dereferencing the iterators.  Of course you'd have to be
> able to compare two (for example) utf-16 ranges meaningfully based on some
> locale, just as if a human who knew the symbols was comparing the glyphs
> that would be drawn for each range.

Right.

I'm not sure though whether that kind of "human" intelligence can be
succinctly described in an algorithm -- unless of course it boils down
to a simple "nested switch" statement (that could be code-gen'ed
anyway).

> Another idea I like though, is that
> dereferencing an iterator would return one UCS codepoint and it would be up
> to a higher level of abstraction to fetch the combining characters and form
> the final glyph.  That way, any string that encoded UCS, whether it was
> utf-32, utf-16, or utf-8, could return char32_t from dereferencing an
> iterator.  I suspect that either or both of these as well as other
> variations would at times be the better idea, because the interpretation of
> the underlying code varies so much.  Lots of places share the same scripts
> but with quite different rules about what to do with them, and how to
> combine or compare them.  Beware of a naive solution if the intent is to
> make a completely general solution.  I'm not even sure if it's possible
> without doing a layered approach.
>

I agree in general that the naive solution can be misleading and can
potentially be worse than if you had the string encoding information
directly in a string type. This is where I think things like Proto or
the smart use of template metaprogramming (on a micro-scale,
especially with iterator nesting) can allow the fusing of certain
transformations/encoding+decoding techniques, but only in cases where
you have the nesting statically defined.

Of course the proof will be in the pudding once that implementation
starts to bake. ;)

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk