|
Boost : |
From: Julien Blanc (julien.blanc_at_[hidden])
Date: 2024-12-13 16:01:06
his is my review of the proposed boost::hash2 library
# What is your evaluation of the design?
The base idea, separating the hash algorithm from the actual hashing
of objects, is a brilliant one.
The library consist on two parts:
* a collection of hashing algorithm, some of them of crypto quality
* a c++ objects hashing facility
Given this design, i would not have expected from the library to provide
a thorough collection of hash algorithm. But it's a good think to get
them. I would have expected, however, to see some guidance about how
to integrate an external hashing library. Actually it is unfortunate
that this part is missing, because it seems a lot more harder than it
should be.
Several design decisions could be improved in my opinion:
* the requirements on the HashAlgorithm concept are too strict, and have
no rationale in the documentation
* copy constructibility and copy assignment are unusual across current
hash implementations, as it has been noted by other reviewers. I can
understand a rationale for copy construct, copy assignment looks very
weird to me. The only rationale i can think of is resetting the internal
state to a defaulted state, which could be a specific operation.
* creation with seed (either as uint64 or as an array) as well as default
construction is required. It seems odd. The documentation, for example,
strongly advice agains using siphash without a seed. I would prefer that
it is a compile time error to use siphash without a seed / with an
incorrectly sized seed. I've no problem adding a key/seed to algorithms
that do not mandate them, though.
But, since the library itself does not create the HashAlgorithm
themselve (there is one notable exception to that that), it should not
put additional requirements beyond what it uses. The HashAlgorithm
concept could be:
struct HashAlgorithm
{
using result_type = /*integral or array-like*/;
static constexpr int block_size = /*...*/; // optional
void update( void const* data, std::size_t n );
result_type result();
};
Interestingly enough, most of object hashing works if you provide it
with such an algorithm (not copyable and not default constructible). I
failed to see where default construction is used outside tests, for
example.
* the usage of void const* + size in interface is controversial, as several
people would prefer a span<>. There are good reasons for not using span,
and this is not something that should be an issue, as this should only
impact HashAlgorithm writers. It is in my opinion an unfortunate design
choice that this internal interface is leaked to the end user. The
functionality is already provided by hash_append_range, which could
be the advertised way of hashing bytes via that library (ie, when mixing
hashing of object and hashing of bytes, eg when implementing custom
hashing for a type).
* constexpr use mandates usage of unsigned char const*. So it could be
the primary supported interface over void const*, to avoid interface
duplication.
* the requirement on result extension is also uncommon. While i understand
the rationale, it should be noted that it is likely to cause some
language interoperability issues, as it is not necessarily
specified by the algorithm. Implementations in other languages are
unlikely to provide them, making the feature a niche one, unless the
underlying algorithm specification provides it.
* most of the functionality is provided by the hash_append function, and the
hash_append_xxx helpers. This is a neat design.
* The Flavor template argument allows to customize hashing of objects, to
get platform-independants hashes. This is a good idea, but it comes with
some problems:
* the flavor is repeated over each hash_append call. I don't think it is
a realistic use case that the flavor changes over time. If i were to
integrate the library in our code base, the first thing i'll do is
write a wrapper to get rid of that flavor argument at every call site.
* the default flavor uses std::uint64_t for size. It seems odd. If i'm
not asking for a specific flavor, i'm not asking the hash to be
consistent accross different systems. So std::size_t should be used
there instead. This looked to me like a pay for what you don't use
situation, so i ran a quick near worst-case benchmark on our arm
platform (hashing a million times a std::vector of the three integers
1-2-3), to find out that it makes a the code using size_t runs 15%
faster.
# What is your evaluation of the implementation?
The implementation looks good. Code is clean, very readable.
A minor remark in hash_append_unordered_range : m should be declared as
std::size_t (or std::uint64_t). The current
typename std::iterator_traits<It>::difference_type m = 0; is wrong.
I'm not at ease with hash_append_unordered_range. The fact that it
reduces the hash to 64 bits, simply making a sum of the element hashes,
and then hashing that, does not look right. If the goal is simply to
avoid unintended collisions, it's probably fine. If the goal is to preserve
against an attacker mindfully trying to provoke collisions, it may do the
trick,although intuitively xoring the whole results would be better at that
job than summing truncated results. If it's to preserve cryptographic
properties of the hash function, then it looks wrong.
Another remark is about size serialization. The library relies on
std::tuple_size to identify container-like structs, but not include the size
in the hash (or is it because of is_contiguously_hashable?). This leads to
unintuitive results, such as:
std::array<int, 4> a{1,2,3,4};
std::span<int, 4> s(a);
Hash h1,h2,h3,h4;
hash_append(h1, {}, a);
hash_append_range(h2, {}, begin(a), end(a));
assert(h1.result() == h2.result()); // TRUE
hash_append(h3, {}, s);
hash_append_range(h4, {}, begin(s), end(s));
assert(h3.result() == h4.result()); // FAILS
I'm not sure there's a general solution to this issue, though. But since
std::array models a container, not including its size in the digest may not
be the wisest choice (the doc talks about plain arrays, not about std::array,
so i don't know if it's a bug or a feature)
# What is your evaluation of the documentation?
The order of the documentation is odd. I would expect it to first explain
me how i use the library with the provided algorithm to hash c++ data or
bytes, but it first goes into deep details about how for example how result()
can be called multiple times, or what requirements are set on custom hash
algorithms.
A description of the scope of the library would be welcomed. It both provides
hashing for use by containers, and cryptographic quality hashing. These two
seems at first unrelated, and the library lies in a not clearly defined
in-between. It has been explain during the review, and the arguments given
are convincing, and should find their place in the documentation.
The motivation for machine architecture independant hashes is lacking. Usually,
the way to achieve this is to use a domain specific canonical form, because
endianness or word size is unlikely to be the only variable at stake here.
But overall, the documentation has all the relevant information. A tutorial /
quick start would be a nice addition.
I dislike the "single page style" documentation, and hope the tooling can be
improved in this regard.
# What is your evaluation of the potential usefulness of the library?
Do you already use it in industry?
I have a need for an easy to use sha1, that i can easily compile for embedded
targets. Having that in boost actually lowers the number of external
dependancies, so yes this is definitely useful.
I currently don't have a need for the object hashing facility, but would
definitely use the library it if the needs arise.
# Did you try to use the library? With which compiler(s)? Did
you have any problems?
I tried with several flavors of gcc, cross-compiling for bare-metal cortex-m0
targets or embedded arm32 linux, as well as standard debian x64. Everything
worked.
# How much effort did you put into your evaluation?
A glance? A quick reading? In-depth study?
I did not review the algorithm implementations. I spent around 15 hours
on studying the library doc and internals.
# Are you knowledgeable about the problem domain?
No specific knowledge of the domain.
# What is your final recommendation
I would recommend to CONDITIONALLY ACCEPT the library into boost, the
conditions being:
* demonstrate the usage of an existing external hash library into the
boost::hash2 framework
* remove default construction and too short key construction for algorithms
where this is actually harmful (such as siphash). I don't have an issue
with providing constructors with keys for algorithms that do not mandate it.
But boost code should encourage correct usage of the algorithms it propose,
not allow bad usage in the name of genericity.
The following improvements should imho be done, but i don't base my conditional
acceptance on them. Actually some of them can be done on top of the proposed
library:
* advertise hash_append_range as the recommended way to add untyped data to
the hash algorithm (or define an additional hash_append_bytes to fulfill
this role).
* adds a tutorial / introduction in the doc, move the whole hashing byte
sequences item after the hashing c++ objects item (if previous point is
adressed, part of it would have to be included in the hashing c++ objects
item).
* makes copy construction requirement on hash algorithm a requirement only
if hash_append_unordered_range is used. Hashing unordeder ranges is an
open problem, some creative other solutions may be found, and i see no
reasons why they would involve duplicating the algorithm state. Actually,
if the hash is default constructible, then it is not needed. You need
either one of default or copy constructible, not both.
* use std::size_t instead of std::uint64_t for the default_flavor, as this
is faster on some platforms.
* provide wrappers that avoid repeating the flavor over and over. It is much
more convenient for the user to write once using
MyHash = hash2::FlavoredHash<hash2::default_flavor> and then
MyHash::append(X) than repeating over and over the parameter.
Finally, i want to thank Peter for developing and proposing this library
for boost, and Matt for managing the review.
Regards,
Julien
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk