Boost logo

Boost :

From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2020-06-22 06:31:26


On Thu, Jun 18, 2020 at 1:22 PM Phil Endecott via Boost
<boost_at_[hidden]> wrote:
>
> This is my review of Zach Laine's proposed Boost.Text.

Thanks for the review, Phil.

> Zach is offering several things here: Unicode, a "better string",
> a trie, and more. I'm pleased that these components are all
> individually exposed to users and are independent of each other,
> rather than some being hidden as implementation details.
>
> I've spent a couple of days looking at a few parts of the library.
> I'll discuss at the Unicode stuff first.
>
> Boost gained support for Unicode functionality when Boost.Locale
> was accepted almost a decade ago, but I was not particularly
> impressed by that library and have not used it in my own work.
> Does the current proposal do better? I think it does, and
> though it's not perfect I think it would be worth accepting but
> subject to a concern about copyright.
>
> See another thread for the copyright discussion. Basically,
> if I understand correctly, Zach's code includes tables that are
> derived from Unicode data and the licence for that data requires
> a copyright attribution. The Boost licence does not require
> attribution and I support that. So if I've understood the
> situation correctly - and I am not a lawyer! - this means that
> sadly Boost must REJECT the affected code.

Agreed.

> I'll say some more about the other aspects of the Unicode code,
> just in case the copyright issue turns out to be surmountable.

Me too. I've sent an email to someone at ICU, as alluded to earlier
in this thread.

> I've tried just a couple of things. Firstly, I tried the UTF-8
> to UTF-32 conversion. This seems to work, and the API is
> reasonable. Performance is not bad, but it could be better
> (see another thread); that could be fixed later. I think it
> would be worth having just a UTF library in Boost!

Could you provide the exact scenario you tried, and point me to the
conversion technique that you used that was faster?

> Secondly I looked at the line-splitting algorithm, i.e.
>
> template<typename CPIter, typename Sentinel, typename Extent,
> typename CPExtentFunc>
> unspecified lines(CPIter first, Sentinel last, Extent max_extent,
> CPExtentFunc cp_extent, bool break_overlong_lines = true);
>
> Again this seems to work, and the API is reasonable. Reference
> documentation is somewhat lacking though; in the description of
> lines(), the requirements for CPExtentFunc need to be described,
> and the "unspecified" return type needs to document that it includes
> members including iter and hard_line_break. I think there are
> probably similar omissions in much of the reference documentation.

Thanks! I think all those omissions in the reference docs are covered
by the tutorial docs, but yeah, I definitely need to add explanations
there.

> I'm unconvinced by the use of the word "sentinel" here. A sentinel
> is a value with a special meaning, e.g. a terminating null; what
> you have here is an end-iterator. It's good that you're allowing
> the begin and end iterators to have different types, but I think
> you should name that type something like CPEndIter.

This is standard terminology, as of C++20. See
http://eel.is/c++draft/ranges for numerous uses. It means
end-iterator-or-other-type-that-acts-as-the-end-tag.

> The implementation of line splitting seems to call its word
> length callback more often than necessary, i.e. when a word
> overflows a line, that word is subsequently measured a second
> time. That should be easily fixable.

Thanks, I'll revisit that.

> Moving on to other parts of the proposal, I'm less happy with
> text::string. The first thing I noticed was the very strange
> decision to use int, rather than (signed?) size_t, for indexing;
> yes I do work with chunks of text larger than 2 GB! I would
> REJECT text::string for this alone.

Sure. You're in good company.

> I also dislike the use of operator() to extract substrings; the
> meaning just isn't obvious:
>
> auto x = y(4);

I think this is true because of lack of familiarity. Seeing this a
few times would make it familiar pretty quickly. That being said, I'd
prefer if I could write y[0:4] or y[0,4] but I can't, at least not
without using some kind of strong typedef and a UDL, like
y[0_something,4]. Anyway, the string types are moot at this point.

> Is that the 4th character of y? Or the substring from 4 to the
> end? No, it's the substring from the start to 3. I'd need to
> add a comment to explain what it's doing. But actually there's
> another problem with that line: x is a non-owning view of that
> substring. I think that methods that return a view (or span,
> etc.) should have a name that makes that clear, e.g. "subview".
> (This wouldn't have been a problem before auto.) This came up in
> the static_string review, where the substr() method was originally
> proposed to return a view; that was renamed subview() with
> substr() returning a copied string.

Again, this is a question of familiarity. If std::string did not have
substr() (as text::string does not) there would not be confusion.
Silent memory allocations (substr() semantics) are a larger footgun
than dangling references (op() semantics) in the age of ASan, IMO.

> I don't mind that the string isn't templated on the char type,
> but I do miss the allocator support; I've used strings with arena
> allocators (for performance) and with shared memory allocators.
> I know allocator support is a lot of work but it is necessary, IMO.

I disagree. Allocators are a poor design, in that they are too
general. Consider std::vector. An insert operation has different
optimal implementations depending on whether the vector is a fixed
capacity, inline-storage type or a heap-allocated type. An allocator
changed the allocation strategy, but leaves all the other code that
often depends on the allocation semantics unchanged. A better design
is to have a separate templates for inline storage (e.g.
boost::container::static_vector), sbo-optimized heap storage (e.g.
boost::container::small_vector), and heap storage (e.g. std::vector).
Many other libraries have come to the same conclusion, including FB
Folly, LLVM's internal lib, and more.

> text::text seems to be the obvious combination of text::string
> with unicode functionality, and I've not looked at that.

If you have time, please take a look. text::text is almost certainly
soon to be reimplemented in terms of std::string.

> I have had a look at the trie; I have sometimes wondered to what
> extent a trie can be implemented in C++ that works as much as
> possible like a standard container so this is interesting.
>
> The docs should mention briefly why trie is here, i.e. where it
> is used, just in anticipation of that obvious question from users.
>
> trie::operator[] returns an "optional ref". But it's not like
> a std::optional; as well as operator* it has conversion operators
> to access the value. Zach notes that this is "wonky" when the
> value type is bool, but it also seems to break for me when it's
> int:
>
> boost::trie::trie_map<std::string, std::string> counts;
> auto x = counts["Hello"];
> if (x) std::cout << "present\n"; else std::cout << "absent\n";
> // Prints absent, OK.
>
> boost::trie::trie_map<std::string, int> int_counts; // Now int.
> auto y = int_counts["Hello"];
> if (y) std::cout << "present\n"; else std::cout << "absent\n";
>
> // ../text/include/boost/text/trie.hpp:78:
> // boost::trie::optional_ref<T, Const>::operator T&() &
> // [with T = int; bool Const = false]: Assertion `t_' failed.

Thanks! I was missing an operator bool() overload for *mutable*
lvalue ref. It works when I add that.

> operator T&() is being called by if(y), rather than
> operator bool(). There aren't has_value() and value() methods
> that I can use to work around this.
>
> Like a few other things in this proposal, Zach has implemented
> trie::operator[] how he wishes the standard library worked.
> As another example, dereferencing a trie_map iterator doesn't
> return a pair<KEY,VALUE> as a normal associative container
> would, but rather a struct with members called key and value.
>
> At worst, this will hit problems that the standard solutions
> avoid (as above). At best, it will make the library more
> difficult to learn for users who are familiar with the standard
> library (including its problems),

I think Boost should be a place to experiment with approaches to
addressing those problems, rather than propagating them. I find
.first and .second to be horrible names for the key-value pair used to
represent the mapping of a std::map. I don't think it's appropriate
for new code. Code that uses structured bindings should not be
affected.

> and make generic code less
> likely to work with it.
>
> As a test I wrote a simple program to count word frequencies:
>
> using counts_t = std::map<std::string,int>;
> counts_t counts;
>
> while (std::cin.good()) {
> std::string s;
> std::cin >> s;
> counts[s]++;
> }
>
> for (auto i: counts) {
> std::cout << i.first << " = " << i.second << "\n";
> }
>
> It would be great if I could just change it to:
>
> using counts_t = boost::trie_map<std::string,int>;
>
> but I can't do that; I need to change things:
>
> struct count { int value = 0; }; // Wrap the int.
> boost::trie::trie_map<std::string, count> counts;
>
> while (std::cin.good()) {
> std::string s;
> std::cin >> s;
>
> // operator[] doesn't insert, so:
> auto r = counts[s];
> if (r) r->value++;
> else counts.insert(s,count{1});
> }
>
> for (auto i: counts) {
> std::cout << i.key << " = " << i.value.value << "\n";
> }
>
> That works; how about performance? Trie should perform
> well when the stored strings frequently share common prefixes,
> so I've tested with a large web server log file; this has
> timestamps and URLs which share common prefixes. The results:
>
> std::map: run time 1.2 seconds, memory 1007 pages
> trie_map: run time 1.8 seconds, memory 4868 pages
>
> So that's rather disappointing.

First, it's important to note that frequent common prefixes don't
necessarily make the trie more lookup-efficient. If you are looking
for "fred" and you actually have a trie containing f -> r- > e -> d as
nodes, you'll visit all 4 nodes as you look up "fred". For a string
of 10 chars, you'll visit all 10 nodes for a match, etc. For an R/B
tree like std::map, you'll only visit log(N) nodes, where N is the
size() of the map. This is unlikely to be nearly as large as the
longest string for many data sets.

The docs explicitly state that the trie_map and trie_set templates are
there for convenience, and that if you care about performance you
should use trie. Moreover, if you're trying to eat the longest token
that matches some list of tokens encoded in a trie, the trie will be a
lot faster, since you don't have to redo a log(N) search M times,
where M is the length of the token you're eating.

If you're using a trie outside of its niche, you should not expect
ideal outcomes. You should use the thing that actually delivers ideal
outcomes in that case you care about.

> Some general comments:
>
> Some people care about whether libraries are header-only or not,
> and the docs should mention this, and possibly say which
> functionality can be expected to work header-only.
>
> I got quite a few missing field initializer and unused parameter
> warnings; please fix these if at all possible. (gcc 6.3, -W -Wall).
> I think you should re-order the docs so that the "Hello World"
> example is much earlier.
>
> Some functions clearly need to embed large Unicode tables; it
> would be useful if the docs mentioned roughly how much the
> executable size is increased by.
>
> Some source files are executable.
>
> Conclusions:
>
> I fear that the only part of this that Boost should accept is
> the UTF transcoding. The implementation is not perfect but we
> could work on that. The other Unicode stuff would be good to
> have but the copyright issue is a blocker.

Zach


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk