Boost logo

Boost :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2020-06-12 19:56:05


On Fri, Jun 12, 2020 at 12:36 PM Rainer Deyke via Boost
<boost_at_[hidden]> wrote:
>
> My name is Rainer Deyke. My experience of the problem domain is mostly
> as a user of other unicode libraries, ICU in particular. Here is my
> formal review of Boost.Text.
>
> > - What is your evaluation of the library's:
> > * Design
>
> String layer: overall, a solution looking for a problem.

The problem this library was designed to solve was the problem that I
hate std::string. :) It solves this problem for me quite nicely by
being a minimal alternative to same. That being said, as I mentioned
earlier in this thread, the string layer (except for unencoded_rope)
can all just go away if that is reviewers' preference.

> The use of 'int' as the size_type in string/string_view is a time
> bomb waiting to go off. I don't /typically/ have strings longer than
> 2GB, but I do use some text-based file formats for which there is no
> /inherent/ reason why the individual files can't exceed 2GB.

There is also no inherent reason why max_size() is less than SIZE_MAX,
and yet it is on every implementation popular today. To be clear, my
point is that typical use is going to be well within either of these
for mutable string data. Anything atypical can use something else.

> (And no,
> unencoded_rope would not be a better choice. I can't memmap ropes, but
> I can memmap string_views.)

You can easily memmap string_views 2GB at a time, though. That is a
simple workaround for this corner case you have mentioned, but it is
truly a corner case compared to the much more common case of using
strings for holding contiguous sequences of char. Contiguous
sequences of char really should not be anywhere near 2GB, for
efficiency reasons.

> I like the use of negative indices for indexing from the end in
> Python, but I am ambivalent about using the same feature in C++. None
> of the other types I regularly use in C++ work like that, and the
> runtime cost involved is a lot more noticeable in C++.

I don't really get what you mean about the runtime cost. Could you be
more explicit?

> Also, using the
> same type of counting-from-end indices and counting-from-beginning
> indices seems unsafe. A separate type for counting-from-end would be
> safer and faster, at the cost of being more syntactically heavy.

Interesting. Do you mean something like a strong typedef for "index"
and "negative-index", or something else? Some example code might be
instructive.

> My first impression for unencoded_rope_view was that it's useless.
> Then I saw the undo/redo example for segmented_vector, and I have since
> amended my opinion to "mostly useless". To clarify:
> - In the most common use for strings, where the string is created
> once, used, and discarded without ever being modified, unencoded_rope is
> worse than a contiguous string: slower lookup, slower traversal, and
> worse data locality.

Correct. That's not its use case.

> - The most common forms of "modification", in my experience, are
> most efficiently done by reconstructing the entire string, which again
> makes unencoded_rope useless. (And concatenation, I guess, for which
> unencoded_rope actually great.)

Well, that really depends on the length of the string, the number and
kind of future edits, etc.

> - Even if my modifications are primarily of the type for which
> unencoded_rope is optimized, there would need to be a fairly high
> modification-to-read-operation ratio for unencoded_string to be a net win.

Agreed. If you use it in cases for which it is designed, it works
better than if you don't. I'm not trying to be snarky; my point is
that it is a niche type -- it is useful anywhere you would like COW
semantics, or anywhere you would like to edit very long strings -- and
you should not use it outside of its niche. How long "very long" is
is best revealed by measuring performance. For most repeated editing
cases, I expect it to be far shorter than 2GB -- the limit of
text::string.

> - But the fact that copies on unencoded_ropes are cheap, and the
> fact that these copies remain cheap when one copy is modified, makes
> unencoded_rope very useful for implementing undo/redo in a text
> editor operating on very large text files.

Exactly. That is only one of its two primary use cases, though, as I
indicated above. It also has an even more niche case, in that it is
easily used in a threadsafe manner. My point is that it has its uses,
even if you don't typically need them yourself.

> Unicode layer: overall, very useful.
>
> One things that appears to be missing is a normalization-preserving
> append/insert/erase operators. These are available in the text layer,
> but that means being tied to the specific text classes provided by that
> layer.

Hm. I had not considered making these available as algorithms, and I
generally like the approach. But could you be more specific? In
particular, do you mean that insert() would take a container C and do
C.insert(), then renormalize? This is the approach used by C++ 20's
erase() and erase_if() free functions. Or did you mean something
else?

> Requiring unicode text to be in Stream-Safe Format is another time bomb
> waiting to go off, but it's also usability issue. The library should
> provide an algorithm to put unicode text in Stream-Safe Format, and
> should automatically apply that algorithm whenever text is normalized.
> This would make it safe to use Boost.Text on data from an untrusted
> source so long as the data is normalized first, which you have to do
> with untrusted data anyway.

This seems like a good idea to me. I went back and forth over whether
or not to supply the SSF algorithm, since it's not an official Unicode
algorithm, but adding it to text's normalization step would be reason
enough to do so.

> Text layer: overall, I don't like it.
>
> On one hand, there is the gratuitous restriction to FCC. Why can't
> other normalization forms be supported, given that the unicode layer
> supports them?

Here is the philosophy: If you have a template parameter for
UTF-encoding and/or normalization form, you have an interoperability
problem in your code. Multiple text<...>'s may exist for which there
is no convenient or efficient interop story. If you instead convert
all of your text to one UTF+normalization that you use throughout your
code, you can do all your work in that one scheme and transcode and/or
renormalize at the program input/output boundaries.

Practically, the world seems to have standardized on UTF-8, NFC (or is
doing so rapidly). I'm not thrilled that I have to use FCC instead of
NFC, but it's close. The reason for using FCC is that it removes the
need to normalize to NFD before doing collation. Collation is
expensive enough without adding memory allocation and NFD
normalization to it. I'm experimenting with adding extra tailorings
to Boost.Text's collation tables to make collation compatible with
NFC. I'm not yet sure if that can be made to work.

> On the other hand, there is the restriction to the base types from the
> string layer. Why can't 'text' be a class template taking the
> underlying container class as an argument? (Including std::string?)

Because, again, I want there to be trivial interop. Having
text<text::string> and text<std::string> serves what purpose exactly?
That is, I have never seen a compelling use case for needing both at
the same time. I'm open to persuasion, of course.

> I have a lot of code, much of it in third-party libraries (such as
> Boost!), that uses std::string as a basic vocabulary type. Converting
> between Boost.Text strings and std::strings at all API boundaries is not
> viable. If I can't use the text layer with std::string, then I don't
> want it.

I understand. See my previous comments about my lack of preciousness
about text::string. :)

> > * Implementation
>
> I didn't look at it in detail.
>
> > * Documentation
>
> Overall, very clear and useful.
>
> The warning about undefined behavior if the data is not in Stream-Safe
> Format should be at the top of every page to which it applies, in big
> and bold text, on a red background, ideally blinking. Or, the library
> could take measures to handle it reasonably safe for handling untrusted
> data.

The SSF assumption is explicitly allowed in the Unicode standard, and
it's less onerous than not checking array-bounds access in operator[]
in one's array-like types. Buffer overflows are really common, and
SSF violations are not. That being said, I can add the
SSF-conformance algorithm as mentioned above.

> .../the_unicode_layer/collation.html: Unicode Technical Note #5
> describes an /extension/ to the unicode collation algorithm which /also/
> allows it to handle FCC /in addition to/ NFD. If your collation
> algorithm doesn't handle NFD correctly, then there is something
> seriously wrong with it, and I except that it will also fail on the
> corner cases of FCC.

It does not fail for FCC, but does fail for NFC. I honestly have not
tried it with NFD yet. I will do so.

> Describing NFD as "memory-hogging" is FUD, as the actual difference in
> memory usage between NFD and FCC is trivial for most real-world data.
> (I suppose Korean Hangul would be a major exception to this.)

Ok. I can certainly remove that from the docs.

> .../the_unicode_layer/searching.html: the note at the end of the
> page is wrong, assuming you implemented the algorithms correctly. The
> concerns for searching NFD strings are similar to the concerns for
> searching FCC strings.
>
> In both FCC and NFD:
> - There is a distinction between A+grave+acute and A+acute+grave,
> because they are not canonically equivalent.
> - A+grave is a partial grapheme match for A+grave+acute.
> - A+acute is not a partial grapheme match for A+grave+acute.
> - A+grave is not a partial grapheme match for A+acute+grave.
> - A+acute is a partial grapheme match for A+acute+grave.
> But:
> - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.

Hm. That note was added because the specific case mentioned fails to
work for NFD, but works for FCC and NFC. I think the grapheme
matching analysis above addresses something different from what the
note is talking about -- the note is concerned with code-point-level
searching results producing (perhaps) surprising results. The
grapheme-based search does not, so which CPs are matched by a
particular grapheme does not seem to be relevant. Perhaps I'm missing
something?

> > * Tests
>
> I found it hard to get a good overview of the tests, seeing as a lot of
> them are dynamically generated test files.

Yes, and it's difficult for me, too. There are a lot of files in
there. As Emil suggested on another thread, I'm going to partition
the tests into generated and non-. Unfortunately, that will happen
after the review.

> One thing I noticed is that there are extensive normalization tests for
> the four basic unicode normalization forms, but none for FCC.

Yeah, that's because they don't exist, as far as I know. There are no
such tests in ICU or on the Unicode website that I can find. That
being said, since everything related to collation and the Text layer
uses FCC, and it all works, I think that sufficient coverage exists.
If it makes you feel any more sure, the algorithm I use for
normalization is just an adaptation of ICU's code, and the NFC and FCC
code paths differ in only a couple of places, controlled by a bool
passed to the normalization algorithm -- it's 99.9% the same code.

> In the text_.cpp test file, there are some raw unicode code points
> without either the actual character or its name, making it hard to see
> what's going on.

That was to make it clear to me which code point is being used. :) To
each his own, I guess.

> In the text_.cpp test file, in the normalization tests, there are some
> edge cases that do not seem to be tested:
> - Erasing a combining character such that another combining character
> merges into the previous base character. (Example: removing the cedilla
> in a A+cedilla+ring_above sequence, yielding the precomposed
> A_with_ring_above character.)
> - Inserting a combining character such that this causes a precomposed
> character to decompose. (Example: inserting a cedilla after
> A_with_ring_above, yielding a A+cedilla+ring_above.)
> - Inserting a combining character that requires existing combining
> characters to be reordered.

Thanks! I'll add those.

> "empty" is misspelled as "emtpy" in text_.cpp.

Ah, thanks.

> > * Usefulness - Did you attempt to use the library?
>
> I was not able to get the library to build, so I was not able to test
> it. But it does look like it should be a good ICU replacement for my
> purposes, assuming it doesn't have any serious bugs.

Huh. What was the build problem? Was it simply that it takes forever
to build all the tests?

Zach


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk