|
Boost : |
From: Matt Borland (matt_at_[hidden])
Date: 2024-12-13 18:20:18
> > I received the following review directly from Ivan Matek, because the review is written as a PDF that exceeds the allowable size of the ML. The document can be accessed here: https://drive.proton.me/urls/A6NCZVTK58#yuGnqqnzYlhA
>
>
> Please post this at a minimum as an HTML email. Preferably reformat to be text-only.
>
> Our future plan for the new boost.org is that we will collect and index reviews in the site database and cross-reference them from library pages. Reviews submitted as nicely formatted PDFs unfortunately will not be picked up by this. It would be best if all reviews were posted to the mailing list so they can be accessed by tooling.
Below is my attempt at converting this to text for posterity.
Boost.Hash2 Review - Ivan Matek December 2024
About my Preferences/Experience/Expectations
I am not a Boost developer. I am a longtime C++ developer.
I know of issues with std::hash, not familiar with Boost.ContainerHash. I am pretty sure I used hash_combine years ago, but not sure. I prefer modern C++, and dislike features that are easy to use wrong. I do not like WG21 ABI/keyword(e.g. co_await) decisions as I feel they are ruining the language. In terms of safety vs performance my opinion is that it depends. In general I feel underlying code can use unsafe constructs as those components are much more used/tested than user facing API, when it comes to user code I prefer safer, harder to misuse constructs, unless we are talking low latency or HPC domains.
When first hearing the name Hash2 I originally expected that the library would help many C++ developers that are frustrated with the fact C++23 std does not know how to hash std::pair or std::vector. That is still the main use case I am considering in this review, but during discussions on the mailing list and reading the documentation I have learned that it is good to also consider Hash2 as a general framework for decoupling type value representation and hashing algorithms.
My knowledge of cryptography is basic.
Bias
I have not interacted in person with the primary author. But over years he has replied many times to my questions on the mailing list and I have noticed he is active on reddit where he often provides very good comments/answers, I often use the named argument trick I learned from his blog. Boost.Mp11 is one of my favorite Boost libraries. Beside that I know the author did a lot of work in Boost, e.g. fast hash maps and Boost.Describe.
Due to all things above I consider him a world class C++ expert.
With regards to the second author if I remember correctly we have never interacted. As for authors of the original old WG21 proposal I know that Howard is the author of std::chrono, provided many great answers on SO, and IIRC was author of original versions of libc++ so I also consider him a world class C++ expert.
Limitations
I did not follow any other discussion except the mailing list.
Design
I think the general C++ design of the library is great, so I will not spend much time on that, beside saying that I feel that like STL separates iterators and algorithms it separates hash algorithms and object representations. Beside that not hashing size when size of type is constant is a smart optimization.
C++11
With regard to the decision to support C++11 I think new Boost libraries should not support C++11, but that is a topic that has been discussed at length, and in any case it is not a reason for library acceptance or rejection. It is possible that with C++17 or C++20 API could be nicer, but I can not judge that without actually seeing it, besides assuming that the concept for the algorithm could give nicer error messages.
HashAlgorithm Concept
Unfortunately I do not know enough about domain to judge it, and if it would be better that there are more than one concept. Describe With regards to dependency on Describe I feel Describe is a great library, although I have never used it in a production so this is a huge benefit, not a liability. I am very happy to see when Boost libraries are combined to produce nice code. I understand that people are worried for paying compilation of includes that they do not use, but I think benefits outweigh this unfortunate downside.
Variant
I do not understand why variant is not supported.
There are three main areas where I think there is opportunity for improvements.
Easy to use, hard to misuse wrappers for users
More advanced avoidance of hashing size
Cryptographic guarantees
Easy to Use Hard to Misuse
It is possible that I am misunderstanding the reasons for the signature of tag_invoke function, but it is not clear to me why current:
template <class Hash, class Flavor>
friend void tag_invoke(boost::hash2::hash_append_tag const&,
Hash& h, Flavor const& f, X const& v) { boost::hash2::hash_append(h, f, v.a); boost::hash2::hash_append(h, f, v.b);
}
can not be
template <class HashAppender>
friend void tag_invoke(boost::hash2::hash_append_tag const&,
HashAppender& ha, X const& v) { ha.hash_append(v.a);
ha.hash_append(v.b);
}
(Flavor is encoded in type HashAppender).
We remove from the user the ability to by mistake call update, result, pass wrong flavor(by passing {}). We also remove from them the ability to copy construct a local copy of h, but it is not clear to me if use cases for this would be common.
Notice that this would remove access to update from most users. I mention this because there was a lot of discussion about span vs char*, len.
In my opinion we can have the cake and eat most of it with wrapper helpers.
Similar example is when it comes to mailing list suggestions to replace result() with finalize() and make further calls UB. I have no strong opinion here since I do not understand how realistic it is to repeatedly call result(), benefits it provides, if it is useful why (according to the mailing list) no other library supports that use so for sure here we are discussing more than just a naming
and those design discussions are much more important than addition of simple wrapperâ¦
But I think this is another opportunity for a wrapper that uses result() internally, but only exposes finalize() to the user.
Here wrapper may have small potential overhead compared to finalize() design since result() mutates the state while finalize() could rely on further calls being UB and not update state. It is possible that an optimizer might realize this even when wrapper is used, but I would guess it is too much to expect from the optimizer.
Another case mentioned in the mailing list is hashing char* that is probably not what the user wants since it hashes pointer, not pointed to string. I know library must allow it, e.g. how would user cast char* in std::tuple<double, char*> in his code to integral value, sometimes raw pointer is key in a map⦠but it could be made explicit longer name function user must explicitly call, while common case hash_append does not allow pointer values.
Another opportunity for wrappers is to have different wrappers when the user is concatenating bytes of a byte stream versus when the user is concatenating values. Here the helper makes the choice if to append size or not for types like std::span<char> or vector<uint8_t>, i.e. ranges of bytes clearer since it is in the helper name.
Simple examples:
simple_hash_values<fnv1a_32> h; std::string x{"ab"};
std::string y{"c"}; h.append(x);
h.append(y);
will produce different result than if we did
simple_hash_values<fnv1a_32> h;
simple_hash_values<fnv1a_32> h; std::string x{"a"};
std::string y{"bc"}; h.append(x);
h.append(y);
, but this two cases produce same result:
simple_hash_bytes<fnv1a_32> h; std::string x{"ab"};
std::string y{"c"}; char z[2] = {'d', 'e'}; h.append(x);
h.append(y);
h.append(z);
simple_hash_bytes<fnv1a_32> h; std::string_view x{"abcd"}; std::array y = {'e'}; h.append(x);
h.append(y);
It is not a requirement of a framework to produce wrappers, e.g. Boost.ASIO does not have to provide websocket code in it, it is fine for Beast to do it. But I wonder if in this case the potential for bugs is so large that it would be beneficial to add helpers to the âcoreâ library.
More Advanced Avoidance of Hashing Size
Currently library smartly avoids hashing size for types whose representation size is fixed, e.g. int, std::array<char, 10>, etc⦠but it encodes size for variable representation size types, e.g. std::string_view, std::vector<float>.
This is correct in general case since to know object ranges in serialization stream we can afford at most 1 object with variable size, but we do not know what objects future calls to append may hash, so the only safe general solution is to always hash size.
But there is a specific case when we can avoid hashing the size of variable representation size object. When we know it is the only object of variable representation size we append between construction and destruction of the hasher.
This for example happens when we hash the std::string key in unordered_map. We do not need to encode its size since it is the only object used to produce the hash.
Here the library could provide a helper that understands this case, and as a bonus as in case of previous wrappers prevents wrong use(i.e. result() is called by wrapper, not user).
This could be extended for more complex types, e.g. std::pair<int, std::string> since here we only have 1 type with variable representation size, but I presume the most common case is when the type used is a simple type. Experiments with performance gains regarding this can be seen in the Performance section. In any case this is not a big issue because users can write this kind of wrappers for their use case relatively easily assuming they avoid some mistakes such as not appending anything in case of empty string.
Cryptographic Guarantees
This is the biggest concern I have about the library. As I am writing this there is currently a discussion on the mailing list suggesting hashing unordered ranges degrades cryptographic properties of used hash function. I know it is unlikely that this will matter for most users but as far as I understand currently that is not mentioned in documentation or in source code. I think it should be explicitly mentioned in both places.
Just to make this point clear: the user does not even have to explicitly call this function, due to the dispatch rules of hash_append. This may sound too nitpicky, but I really believe this is a serious issue, for example of simple decisions that can have a big impact you can check this $160 million github issue.
Performance
I have tried to benchmark hashing performance in a few scenarios. All done with Clang 19.1 on Linux with libstdc++.
Describe Overhead
I was curious if the performance of the described class(class that uses Boost.Describe to achieve reflection) is the same as when I manually call hash_append for each member. This was initially not the case, but the author told me it is probably due to Clang not inlining, when I increased Clang inline âbudgetâ performance difference disappeared, so this is great proof that in itself Describe is zero overhead.
FNV1a for Short String
I was curious if not encoding size would speed up hashing in cases when we hash just 1 value and then call result, e.g. what happens in insert/find of hash map. I focused on FNV1a variants since they were the fastest.
In some cases hashing was so fast I had to check online with some tool that produced results that were correct since Clang unrolled loops so hard that performance was insanely fast. Not much to say here except that this confirms that for this case library abstractions do not use any techniques that might make Clang optimizer fail.
But back to the general goal of determining if encoding size matters or notâ¦
I tried it with some collection of short strings and the performance difference exists, but it is a bit complicated.
For std::string performance difference was almost 2x because clang gave up on inlining in case we were hashing size.
For boost_static_string performance difference was noticeable, but tiny, obviously depending on length of string.
For very short boost_static_strings hashing without size was 15% faster, but as size increased the difference remained the same in absolute time, so the relative performance difference was dropping.
To keep this short I will not repeat the details here, if you want to know more you can search for "What is 7 bytes among friends"
in the mailing list(to satisfy the need to hash something we always hash dummy byte if we do not hash size).
While this is micro optimization I believe there is a design space to exploit it, discussed in detail in the Design part of the document.
Comparing to SIMD SHA256
As I am not a Rust expert so I thought it would be nice exercise to check how fast/slow is a simple program written in Rust using what I quickly googled to be default SHA2 implementation(point was to simulate quick research and coding, instead of doing long evaluation of alternatives). As Hash2 does not use SIMD I expected Rust to be multiple times faster and it turned out to be the case.
On my 13700H machine this program runs 6x faster than Boost.Hash2 equivalent.
fn main() {
const NUM_INTS: usize = 42_000_000; const VALUE: u32 = 42;
let data: Vec<u32> = vec![VALUE; NUM_INTS]; let byte_data = unsafe {
std::slice::from_raw_parts(data.as_ptr() as *const u8, data.len() * std::mem::size_of::<u32>())
};
let start = Instant::now();
let mut hasher = Sha256::new(); hasher.update(byte_data);
let hash_result = hasher.finalize(); let duration = start.elapsed();
println!("{} microseconds", duration.as_millis()); println!("{:?}", hash_result.as_slice());
}
I picked SHA256 since I know it benefits from SIMD instructions, did not check other algorithms.
Documentation.
I find documentation thorough and very well written, but hard to read for non experts, partly because the problem solved is non trivial, fundamental differences from std::hash and partly because the library does not expose easy to use wrappers. Few examples follow.
What is the relationship of this library with Boost.ContainerHash? If I am C++ beginner how do I decide which one I should use for my usecase? I know in general it is not nice to claim one library is better than another, but the primary author of Hash2 is maintainer of Boost.ContainerHash and Hash2 library depends on it so I think it will not be a problem.
Biggest pitfall that users need to be aware is that passing ranges that do not have constant size, will encode their
size. This is obvious to me in retrospect, but I never had to think about this in the old world of std::hash because std::hash does not know how to hash more than one value.
In particular I believe users need to be told explicitly about following:
hash_append and hash_append_range are not just the same logic with different arguments, as for example std::ranges::find(v.begin(), v.end(), 74656) and std::ranges::find(v, 74656).
They encode different information, then as an example explain that if you want to hash chunks of byte stream using std::span<char> you can not call hash_append because it will encode size of each chunk. You must call hash_append_range.
Currently above is possible to understand if you read the documentation carefully, but I do not feel average C++ developer will be able to read hash_append dispatch rules(that make sense, but are very long) and figure out what is happening when he passes std::span, so I think we need few more // WRONG! code snippets in documentation and mention of this in the introduction.
I do think it is great to link the proposal in the documentation, but I would move it from introduction since I believe for most users reading proposals is too difficult.
Regarding unordered_map hash example I understand that hash maps are not the focus of the documentation, but I presume many users will be desperate to get a performant transparent comparator for stringy types, so I think it would be good to add an example with transparent hasher and comparator that handles std::string, std::string_view, char* in a performant way. For example of how difficult it is to do this in C++20 you can see reddit example(make sure to read about ABI in comments).
Regarding other examples I feel that using nonRAII fopen/fclose and char* will scare away some users. I understand the example is about Hash2, not modern C++ or Boost.Scope, but I feel this is one more argument that wrapper helpers would be nice, e.g. we could wrap our char array and read size in span.
An example with Mp11 is amazing because it replaces the switch needed if implemented in a simple way, but here again I wonder if most users will be confused and not understand it.
Potential fix would be to use simple switch code in documentation and link to more advanced examples in the examples folder(if documentation has a way of detecting dead links so that it does not become a liability).
Performance
I believe it should be mentioned that algorithms are not implemented using SIMD instruction. I fully understand this is not the task of a hashing framework, but I still believe users will be unhappy if they find out after using the library for months that their sha256 is 5x slower than it can be. For the average user explanation that Hash2 is a hashing framework, not a collection of SIMD algorithms for different architectures will probably not mean much. They will just remember that Boost is slow, potentially sharing that opinion with coworkers and/or social media. So I think mentioning that no SIMD/AVX2 implementations are used in docs is very important.
General Notes About Review Process
I personally found the review process quite interesting, although it can be a huge time sink if you follow a few rabbit holes. I know it is not the primary purpose of the review but I can say I learned a lot in a few days.
I noticed that very few comments about documentation were about how approachable it is to beginners(or even the average C++ developer). I would suggest that Boost considers adding this question to the list of questions during the review process.
I have benefited tremendously in my evaluation from Samuelâs response. I have noticed that he did not post very often on the mailing list so I presume it was just luck or somebody pinged him as a domain expert. I wonder if this should be encouraged while announcing a review since Boost has plenty of C++ experts, but probably not so many domain experts for every field in which Boost is used. This would not mean Review Manager mailing random people he found on the internet, more of an encouragement for mailing list members to reach out to people in their circle they know might be knowledgeable about the domain and explain they do not need to review the entire library, e.g. mathematician can offer immensely valuable help reviewing some part of hypothetical graph library even if his knowledge of C++ is basic.
I think we should update the policy page not mention modem :)
tl;dr
What is your evaluation of the design?
I think the C++ design of the library is great, although the average C++ developer would benefit from an easier to use higher level interface. Can not judge cryptographic design.
What is your evaluation of the implementation?
Did not spent a lot of time looking into it, everything I saw looks fine from C++ perspective, can not judge wrt
cryptographic correctness so I defensively assume it is not suitable for cryptographic purposes. This may sound unfair, because it is, but as I have no skills to assess the library in this regard I must assume the worst.
What is your evaluation of the documentation?
Detailed, well written, but not for beginners(too many hard concepts like type traits, too few examples of what not to do)
What is your evaluation of the potential usefulness of the library? Do you already use it in industry?
I think the library is currently useful for algorithms that do not benefit from SIMD implementations. But I feel that in most companies using it would require wrapping it in higher level wrappers that make misuse harder. I have not used it in production, I just built a few simple benchmarks and simple wrappers with it. I would never use or suggesting using library for something that interacts with the âoutside worldâ since I am quite careful and I think any potential issues are much more likely than in something popular like OpenSSL, not because OpenSSL is so great, but because it is much more popular/scrutinized.
Did you try to use the library? With which compiler(s)? Did you have any problems?
I used Clang 19.1, C++23.
Problems were just me wrongly assuming how spans of chars are âconcatenatedâ when hashed, that doubles are trivially hashable, etc. no problems related to library itself.
How much effort did you put into your evaluation?
I spent a few days playing around with the library, but a lot of this time was spent reading up on the domain, figuring out how to enable clang-tidy warnings etc., not a classic design or code review domain expert would do.
Are you knowledgeable about the problem domain?
I know a fair bit about C++, the same can not be said about cryptography.
My evaluation is
CONDITIONAL ACCEPT with following criteria:
More beginner friendly documentation, no need to turn it into a tutorial for C++, but few specific common pitfalls need to be addressed.
Hashing âabâ, âcâ and expecting it is same as hashing âaâ, âbcâ
Hashing char* when user wanted to hash string
Expecting hash_append and hash_append_range to produce same result
Beginner friendly comparison with Container.Hash to help beginners navigate libraries. If comparing libraries is against Boost policy my apologies for this suggestion.
Explicit mention in documentation that hash_append is not preserving cryptographic properties of used algorithm in certain conditions(this is assuming current implementation for hashing of unordered ranges)
Explicit mention in documentation that currently algorithms are not implemented using SIMD.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk