Boost logo

Boost :

From: Rainer Deyke (rdeyke_at_[hidden])
Date: 2020-06-12 17:04:53


My name is Rainer Deyke. My experience of the problem domain is mostly
as a user of other unicode libraries, ICU in particular. Here is my
formal review of Boost.Text.

> - What is your evaluation of the library's:
> * Design

String layer: overall, a solution looking for a problem.

The use of 'int' as the size_type in string/string_view is a time
bomb waiting to go off. I don't /typically/ have strings longer than
2GB, but I do use some text-based file formats for which there is no
/inherent/ reason why the individual files can't exceed 2GB. (And no,
unencoded_rope would not be a better choice. I can't memmap ropes, but
I can memmap string_views.)

I like the use of negative indices for indexing from the end in
Python, but I am ambivalent about using the same feature in C++. None
of the other types I regularly use in C++ work like that, and the
runtime cost involved is a lot more noticeable in C++. Also, using the
same type of counting-from-end indices and counting-from-beginning
indices seems unsafe. A separate type for counting-from-end would be
safer and faster, at the cost of being more syntactically heavy.

My first impression for unencoded_rope_view was that it's useless.
Then I saw the undo/redo example for segmented_vector, and I have since
amended my opinion to "mostly useless". To clarify:
   - In the most common use for strings, where the string is created
once, used, and discarded without ever being modified, unencoded_rope is
worse than a contiguous string: slower lookup, slower traversal, and
worse data locality.
   - The most common forms of "modification", in my experience, are
most efficiently done by reconstructing the entire string, which again
makes unencoded_rope useless. (And concatenation, I guess, for which
unencoded_rope actually great.)
   - Even if my modifications are primarily of the type for which
unencoded_rope is optimized, there would need to be a fairly high
modification-to-read-operation ratio for unencoded_string to be a net win.
   - But the fact that copies on unencoded_ropes are cheap, and the
fact that these copies remain cheap when one copy is modified, makes
unencoded_rope very useful for implementing undo/redo in a text
editor operating on very large text files.

Unicode layer: overall, very useful.

One things that appears to be missing is a normalization-preserving
append/insert/erase operators. These are available in the text layer,
but that means being tied to the specific text classes provided by that
layer.

Requiring unicode text to be in Stream-Safe Format is another time bomb
waiting to go off, but it's also usability issue. The library should
provide an algorithm to put unicode text in Stream-Safe Format, and
should automatically apply that algorithm whenever text is normalized.
This would make it safe to use Boost.Text on data from an untrusted
source so long as the data is normalized first, which you have to do
with untrusted data anyway.

Text layer: overall, I don't like it.

On one hand, there is the gratuitous restriction to FCC. Why can't
other normalization forms be supported, given that the unicode layer
supports them?

On the other hand, there is the restriction to the base types from the
string layer. Why can't 'text' be a class template taking the
underlying container class as an argument? (Including std::string?)

I have a lot of code, much of it in third-party libraries (such as
Boost!), that uses std::string as a basic vocabulary type. Converting
between Boost.Text strings and std::strings at all API boundaries is not
viable. If I can't use the text layer with std::string, then I don't
want it.

> * Implementation

I didn't look at it in detail.

> * Documentation

Overall, very clear and useful.

The warning about undefined behavior if the data is not in Stream-Safe
Format should be at the top of every page to which it applies, in big
and bold text, on a red background, ideally blinking. Or, the library
could take measures to handle it reasonably safe for handling untrusted
data.

.../the_unicode_layer/collation.html: Unicode Technical Note #5
describes an /extension/ to the unicode collation algorithm which /also/
allows it to handle FCC /in addition to/ NFD. If your collation
algorithm doesn't handle NFD correctly, then there is something
seriously wrong with it, and I except that it will also fail on the
corner cases of FCC.

Describing NFD as "memory-hogging" is FUD, as the actual difference in
memory usage between NFD and FCC is trivial for most real-world data.
(I suppose Korean Hangul would be a major exception to this.)

.../the_unicode_layer/searching.html: the note at the end of the
page is wrong, assuming you implemented the algorithms correctly. The
concerns for searching NFD strings are similar to the concerns for
searching FCC strings.

In both FCC and NFD:
   - There is a distinction between A+grave+acute and A+acute+grave,
because they are not canonically equivalent.
   - A+grave is a partial grapheme match for A+grave+acute.
   - A+acute is not a partial grapheme match for A+grave+acute.
   - A+grave is not a partial grapheme match for A+acute+grave.
   - A+acute is a partial grapheme match for A+acute+grave.
But:
   - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.

> * Tests

I found it hard to get a good overview of the tests, seeing as a lot of
them are dynamically generated test files.

One thing I noticed is that there are extensive normalization tests for
the four basic unicode normalization forms, but none for FCC.

In the text_.cpp test file, there are some raw unicode code points
without either the actual character or its name, making it hard to see
what's going on.

In the text_.cpp test file, in the normalization tests, there are some
edge cases that do not seem to be tested:
   - Erasing a combining character such that another combining character
merges into the previous base character. (Example: removing the cedilla
in a A+cedilla+ring_above sequence, yielding the precomposed
A_with_ring_above character.)
   - Inserting a combining character such that this causes a precomposed
character to decompose. (Example: inserting a cedilla after
A_with_ring_above, yielding a A+cedilla+ring_above.)
   - Inserting a combining character that requires existing combining
characters to be reordered.

"empty" is misspelled as "emtpy" in text_.cpp.

> * Usefulness - Did you attempt to use the library?

I was not able to get the library to build, so I was not able to test
it. But it does look like it should be a good ICU replacement for my
purposes, assuming it doesn't have any serious bugs.

My vote: ACCEPT the unicode layer, but only the unicode layer. ABSTAIN
for the other layers, with the caveat that if they are accepted, the
issues I have raised should be addressed.

-- 
Rainer Deyke (rainerd_at_[hidden])

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk