Boost logo

Boost :

From: Rainer Deyke (rdeyke_at_[hidden])
Date: 2020-06-12 20:53:42


On 12.06.20 21:56, Zach Laine via Boost wrote:
>> (And no,
>> unencoded_rope would not be a better choice. I can't memmap ropes, but
>> I can memmap string_views.)
>
> You can easily memmap string_views 2GB at a time, though. That is a
> simple workaround for this corner case you have mentioned, but it is
> truly a corner case compared to the much more common case of using
> strings for holding contiguous sequences of char. Contiguous
> sequences of char really should not be anywhere near 2GB, for
> efficiency reasons.

A memmapped string_view /is/ a contiguous sequence of char. I don't see
the difference.

>> I like the use of negative indices for indexing from the end in
>> Python, but I am ambivalent about using the same feature in C++. None
>> of the other types I regularly use in C++ work like that, and the
>> runtime cost involved is a lot more noticeable in C++.
>
> I don't really get what you mean about the runtime cost. Could you be
> more explicit?

Somewhere in the implementation of operator[] and operator(), there has
to be a branch on index < 0 (or >= 0) in order for that negative index
trick to work, which the compiler can't always optimize away. Branches
are often affordable but they're not free.

>> Also, using the
>> same type of counting-from-end indices and counting-from-beginning
>> indices seems unsafe. A separate type for counting-from-end would be
>> safer and faster, at the cost of being more syntactically heavy.
>
> Interesting. Do you mean something like a strong typedef for "index"
> and "negative-index", or something else?

Yes, something like that.

>> One things that appears to be missing is a normalization-preserving
>> append/insert/erase operators. These are available in the text layer,
>> but that means being tied to the specific text classes provided by that
>> layer.
>
> Hm. I had not considered making these available as algorithms, and I
> generally like the approach. But could you be more specific? In
> particular, do you mean that insert() would take a container C and do
> C.insert(), then renormalize? This is the approach used by C++ 20's
> erase() and erase_if() free functions. Or did you mean something
> else?

I hadn't thought through the interface in detail. I just saw that this
was a feature of the text layer, and thought it would be nice to have in
the unicode layer, because I don't want to use the text layer (in its
current form).

>> Requiring unicode text to be in Stream-Safe Format is another time bomb
>> waiting to go off, but it's also usability issue. The library should
>> provide an algorithm to put unicode text in Stream-Safe Format, and
>> should automatically apply that algorithm whenever text is normalized.
>> This would make it safe to use Boost.Text on data from an untrusted
>> source so long as the data is normalized first, which you have to do
>> with untrusted data anyway.
>
> This seems like a good idea to me. I went back and forth over whether
> or not to supply the SSF algorithm, since it's not an official Unicode
> algorithm, but adding it to text's normalization step would be reason
> enough to do so.
>
>> Text layer: overall, I don't like it.
>>
>> On one hand, there is the gratuitous restriction to FCC. Why can't
>> other normalization forms be supported, given that the unicode layer
>> supports them?
>
> Here is the philosophy: If you have a template parameter for
> UTF-encoding and/or normalization form, you have an interoperability
> problem in your code. Multiple text<...>'s may exist for which there
> is no convenient or efficient interop story. If you instead convert
> all of your text to one UTF+normalization that you use throughout your
> code, you can do all your work in that one scheme and transcode and/or
> renormalize at the program input/output boundaries.

Having to renormalize at API boundaries can be prohibitively expensive.

> Because, again, I want there to be trivial interop. Having
> text<text::string> and text<std::string> serves what purpose exactly?
> That is, I have never seen a compelling use case for needing both at
> the same time. I'm open to persuasion, of course.

The advantage of text<std::string> is API interop with functions that
accept std::string arguments. I'm not sure what the advantage of
text<boost::text::string> is. But if we accept that boost::text::rope
(which is would just be text<boost::text::unencoded_rope> in my scheme)
is useful, then it logically follows that
text<some_other_string_implementation> could also be useful.

> The SSF assumption is explicitly allowed in the Unicode standard, and
> it's less onerous than not checking array-bounds access in operator[]
> in one's array-like types. Buffer overflows are really common, and
> SSF violations are not. That being said, I can add the
> SSF-conformance algorithm as mentioned above.

Unintential SSF violations are rare. Intentional SSF violations can be
used as an attack vector, if "undefined behavior" translates to "memory
error".

>> .../the_unicode_layer/searching.html: the note at the end of the
>> page is wrong, assuming you implemented the algorithms correctly. The
>> concerns for searching NFD strings are similar to the concerns for
>> searching FCC strings.
>>
>> In both FCC and NFD:
>> - There is a distinction between A+grave+acute and A+acute+grave,
>> because they are not canonically equivalent.
>> - A+grave is a partial grapheme match for A+grave+acute.
>> - A+acute is not a partial grapheme match for A+grave+acute.
>> - A+grave is not a partial grapheme match for A+acute+grave.
>> - A+acute is a partial grapheme match for A+acute+grave.
>> But:
>> - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
>
> Hm. That note was added because the specific case mentioned fails to
> work for NFD, but works for FCC and NFC. I think the grapheme
> matching analysis above addresses something different from what the
> note is talking about -- the note is concerned with code-point-level
> searching results producing (perhaps) surprising results. The
> grapheme-based search does not, so which CPs are matched by a
> particular grapheme does not seem to be relevant. Perhaps I'm missing
> something?

I am talking about code point level matches here. ("Partial grapheme
match" means "matches some code points within a grapheme, not not the
whole grapheme".)

>> I was not able to get the library to build, so I was not able to test
>> it. But it does look like it should be a good ICU replacement for my
>> purposes, assuming it doesn't have any serious bugs.
>
> Huh. What was the build problem? Was it simply that it takes forever
> to build all the tests?

The configuration step failed because it tries to compile and run test
programs in order to gather information about my environment, and I was
running in a cross-compile context which prevents CMake from running the
programs that it compiles. Probably not too hard to work around on my
part by simply not using a cross-compile context.

-- 
Rainer Deyke (rainerd_at_[hidden])

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk