Boost logo

Boost :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2020-06-13 23:25:37


On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
<boost_at_[hidden]> wrote:
>
> On 12.06.20 21:56, Zach Laine via Boost wrote:
> >> (And no,
> >> unencoded_rope would not be a better choice. I can't memmap ropes, but
> >> I can memmap string_views.)
> >
> > You can easily memmap string_views 2GB at a time, though. That is a
> > simple workaround for this corner case you have mentioned, but it is
> > truly a corner case compared to the much more common case of using
> > strings for holding contiguous sequences of char. Contiguous
> > sequences of char really should not be anywhere near 2GB, for
> > efficiency reasons.
>
> A memmapped string_view /is/ a contiguous sequence of char. I don't see
> the difference.

The difference is mutability. There's no perf concern with erasing
the first element of a string_view, if that's not even a supported
operation.

> > I don't really get what you mean about the runtime cost. Could you be
> > more explicit?
>
> Somewhere in the implementation of operator[] and operator(), there has
> to be a branch on index < 0 (or >= 0) in order for that negative index
> trick to work, which the compiler can't always optimize away. Branches
> are often affordable but they're not free.

Ah, I see, thanks. Would it make you feel better if negative indexing
 were only used when getting substrings?

> >> One things that appears to be missing is a normalization-preserving
> >> append/insert/erase operators. These are available in the text layer,
> >> but that means being tied to the specific text classes provided by that
> >> layer.
> >
> > Hm. I had not considered making these available as algorithms, and I
> > generally like the approach. But could you be more specific? In
> > particular, do you mean that insert() would take a container C and do
> > C.insert(), then renormalize? This is the approach used by C++ 20's
> > erase() and erase_if() free functions. Or did you mean something
> > else?
>
> I hadn't thought through the interface in detail. I just saw that this
> was a feature of the text layer, and thought it would be nice to have in
> the unicode layer, because I don't want to use the text layer (in its
> current form).

I don't need a detailed interface. Pseudocode would be fine too.

> >> Text layer: overall, I don't like it.
> >>
> >> On one hand, there is the gratuitous restriction to FCC. Why can't
> >> other normalization forms be supported, given that the unicode layer
> >> supports them?
> >
> > Here is the philosophy: If you have a template parameter for
> > UTF-encoding and/or normalization form, you have an interoperability
> > problem in your code. Multiple text<...>'s may exist for which there
> > is no convenient or efficient interop story. If you instead convert
> > all of your text to one UTF+normalization that you use throughout your
> > code, you can do all your work in that one scheme and transcode and/or
> > renormalize at the program input/output boundaries.
>
> Having to renormalize at API boundaries can be prohibitively expensive.

Sure. Anything can be prohibitively expensive in some context. If
that's the case in a particular program, I think it is likely to be
unacceptable to use text::operator+(string_view) as well, since that
also does on-the-fly normalization. Someone, somewhere, has to pay
that cost if you want to use two chunks of text in
encoding/normalization A and B. You might be able to keep working in
A for some text and keep working in B separately for other text, but I
think code that works like that is going to be hard to reason about,
and will be as common as code that freely mixes wstring and string
(and I mean not only at program boundaries). That is, not very
common.

However, that is a minority of cases. The majority case is that texts
have to be able to interop within your program arbitrarily, and so you
need to pay the conversion cost somewhere eventually anyway.

FWIW, I'm planning to write standardization papers for the Unicode
layer stuff for C++23, and the text stuff in the C++26 timeframe. My
hope is that we will adopt my text design here into Boost in plenty of
time to see whether it is actually as workable as I claim. I'm open
to the idea of being wrong about its design and changing it to a
template if a nontemplate design turns out to be problematic.

> > Because, again, I want there to be trivial interop. Having
> > text<text::string> and text<std::string> serves what purpose exactly?
> > That is, I have never seen a compelling use case for needing both at
> > the same time. I'm open to persuasion, of course.
>
> The advantage of text<std::string> is API interop with functions that
> accept std::string arguments.

Sure. That exists now, though it does require a copy. It could also
be done via a move if I replace text::string with std::string within
text::text, which I expect to as a result of this review.

> I'm not sure what the advantage of
> text<boost::text::string> is. But if we accept that boost::text::rope
> (which is would just be text<boost::text::unencoded_rope> in my scheme)

That does not work. Strings and ropes have different APIs.

> is useful, then it logically follows that
> text<some_other_string_implementation> could also be useful.

That's what I don't get. Could you explain how text<A> and text<B>
are useful in a specific case? "Could also be useful" is not
sufficient motivation to me. I understand the impulse, but I think
that veers into over-generality in a way that I have found to be
problematic over and over in my career.

> >> .../the_unicode_layer/searching.html: the note at the end of the
> >> page is wrong, assuming you implemented the algorithms correctly. The
> >> concerns for searching NFD strings are similar to the concerns for
> >> searching FCC strings.
> >>
> >> In both FCC and NFD:
> >> - There is a distinction between A+grave+acute and A+acute+grave,
> >> because they are not canonically equivalent.
> >> - A+grave is a partial grapheme match for A+grave+acute.
> >> - A+acute is not a partial grapheme match for A+grave+acute.
> >> - A+grave is not a partial grapheme match for A+acute+grave.
> >> - A+acute is a partial grapheme match for A+acute+grave.
> >> But:
> >> - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
> >
> > Hm. That note was added because the specific case mentioned fails to
> > work for NFD, but works for FCC and NFC. I think the grapheme
> > matching analysis above addresses something different from what the
> > note is talking about -- the note is concerned with code-point-level
> > searching results producing (perhaps) surprising results. The
> > grapheme-based search does not, so which CPs are matched by a
> > particular grapheme does not seem to be relevant. Perhaps I'm missing
> > something?
>
> I am talking about code point level matches here. ("Partial grapheme
> match" means "matches some code points within a grapheme, not not the
> whole grapheme".)

Ah, I see. Thanks. I'll update the note.

Zach


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk