Boost logo

Boost :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2020-06-16 16:33:17


On Sun, Jun 14, 2020 at 7:25 AM Rainer Deyke via Boost
<boost_at_[hidden]> wrote:
>
> On 14.06.20 01:25, Zach Laine via Boost wrote:
> > On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
> > <boost_at_[hidden]> wrote:
> >> A memmapped string_view /is/ a contiguous sequence of char. I don't see
> >> the difference.
> >
> > The difference is mutability. There's no perf concern with erasing
> > the first element of a string_view, if that's not even a supported
> > operation.
>
> A /lot/ of strings, probably the vast majority, will never be mutated.

Ok, then those should more appropriately be string_views.

> And for the rest, the majority will only be mutated by appending.

That does not help, unless the capacity is so large that a
reallocation is unnecessary.

> Erasing the first element is a nice to have but expensive and rarely
> used feature. If you find yourself doing that a lot, then you probably
> do want a rope.

Any mutation might cause a reallocation. I named one of the
worst-case operations rhetorically, but appending is also bad if it
causes that reallocation. It's not a question of what kind of
mutating operation you're doing, but whether you're mutating or not.

> >> I hadn't thought through the interface in detail. I just saw that this
> >> was a feature of the text layer, and thought it would be nice to have in
> >> the unicode layer, because I don't want to use the text layer (in its
> >> current form).
> >
> > I don't need a detailed interface. Pseudocode would be fine too.
>
> insert_nfd(string, position, thing_to_insert)
> // Insert 'thing_to_insert' into 'string' at 'position'. Both 'string'
> // and 'thing_to_insert' are required to be in NFD. The area around the
> // insertion is renormalized to NFD.

I see -- no surprises here. As I said, I like this idea a lot!
However, see below.

> >> Having to renormalize at API boundaries can be prohibitively expensive.
> >
> > Sure. Anything can be prohibitively expensive in some context. If
> > that's the case in a particular program, I think it is likely to be
> > unacceptable to use text::operator+(string_view) as well, since that
> > also does on-the-fly normalization.
>
> Hopefully only on the string_view and the area immediately surrounding
> the insertion.

No, that's why I picked string_view, and not text_view. text_view
insertion does not normalize the incoming text, but string_view
insertion does. This is in keeping with the philosophy:

- At program I/O boundaries (not all API boundaries), convert to UTF-8 and FCC.
- Internal interfaces that take UTF-8/FCC will not transcode or normalize.
- Internal interface that take non-UTF-8/FCC will transcode and
normalize as needed.

text::operator+(string_view sv) does not know the normalization of sv,
so it normalizes. The alternative is clunky -- you have to make a new
string somewhere to normalize into, and then use operator+() on the
result.

> > Someone, somewhere, has to pay
> > that cost if you want to use two chunks of text in
> > encoding/normalization A and B. You might be able to keep working in
> > A for some text and keep working in B separately for other text, but I
> > think code that works like that is going to be hard to reason about,
> > and will be as common as code that freely mixes wstring and string
> > (and I mean not only at program boundaries). That is, not very
> > common.
>
> Which is why I want to avoid just that.
>
> Your suggestions:
>
> void f() {
> // renormalizes to fcc
> text::text t = api_funtion_that_returns_nfd();
> do_something_with(t);
> string s;
> text::normalize_to_nfd(t.extract(), back_inserter(s));
> api_function_that_accepts_nfd(s);
> }
>
> My suggestion:
>
> void f() {
> text::text<nfd, std::string> t = api_function_that_returns_nfd();
> do_something_with(t);
> api_function_that_accepts_nfd(t.extract());
> }

Right, I get it. I just think you're leaving out the lack of
interoperability with text::text<nfc, std::wstring>, etc. That's not
a trivial concern.

If you have code that needs to stay NFC as in your example, you should
be able to use std::string and insert_nfc() and friends. This is yet
another case where a perf tradeoff forces you to write a bit more
code. That does not seem onerous to me.

Zach


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk