Boost logo

Boost :

From: Peter Dimov (pdimov_at_[hidden])
Date: 2024-12-08 11:42:50


Ion Gaztañaga wrote:
> Hi,
>
> This is my review of this library

Thank you for the review. That's a lot of questions. :-)

> - "HashAlgorithm::block_size" is an int, IMHO it should std::size_t.

Maybe. The type doesn't matter that much for integral constants,
but I suppose size_t is more correct.

> - No move constructor/assigment for algorithms. I don't know if those make
> any sense, but I imagine user implementations with those members are
> perfectly valid.

Hash algorithms should be allowed to provide a move constructor or
a move assignment operator, yes.

> - The semantics for copy constructor/assignment might not be clear. The
> library internally uses the copy constructor (e.g. when implementing the
> unordered range hash), so it certainly expects something more concrete than
> "equivalent to the original".

"Equivalent to the original" is the strongest possible requirement. I don't
see how the library can possibly expect anything more than that. In particular,

> I imagine a copy/assigned algorithm is required to
> lead to the same output when updated with the same data.

... this is implied by "equivalent to the original", because if the copy produced
a different output, it wouldn't be equivalent.

> - Should the HashAlgorithm support calls to "update" and "result"
> several times, interleaving both calls?

It should, yes. Any calls are allowed, in any order.

> - I don't know if SHA-1/RIPEMD-128 should be in the library at all (insecure
> and not sure if useful in any real context).

SHA-1 is still widely used, e.g. for Git commit hashes. We added RIPEMD-128
in order to have a recommended replacement for MD5 in the case where
the digest has to be exactly 128 bits.

> - The original proposal mentions a "variadic version of hash_append",
> (https://www.open-
> std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#variadic)
> maybe that would be a useful utility in the library.

It might be, yes.

> - The original proposal uses a different syntax:
>
> void
> operator()(void const* key, std::size_t len) noexcept
> {
> state_.Update(key, len);
> }
>
> explicit
> operator result_type() noexcept
> {
> std::uint64_t h1, h2;
> state_.Final(&h1, &h2);
> return h1;
> }
>
> Which are the advantages of the new approach (update/result) vs the original
> proposal?

It seems obvious to me that named functions are better than unnamed
functions.

> - The hash_append_unordered_range function is surprising for people like me
> that are not experts on hashing, but it's an important concept for users. The
> original proposal explains the problem (https://www.open-
> std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#unordered),
> I feel the problem should be also explained in the documentation of Hash2,
> and how the library design solves it (using a copy of the HashAlgorithm, as
> proposed in n3980).
>
> N3980 also says that "And there are various other schemes. They are all
> implementable. But they each have their advantages and disadvantages.
> Therefore this proposal proposes none of them. Should the future expose the
> ideal hash_append specification for unordered containers, it can always be
> added at that time."
>
> Why does Hash2 offer a single alternative?

I don't know of a better one.

Constructing a temporary std::set is not a serious alternative. In addition
to being allocating, slow, and potentially throwing, it also relies on the
existence of operator<, which an unordered container doesn't require.

> -N3980 proposes a "type-erase hasher"
> (https://www.open-
> std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#type_erased_hasher
> ),
> would it be interesting to include it in Hash2?

I had the type erased hasher in the "planned additions" list for a long time,
partly because I thought I needed it for the hash2sum example. But it doesn't
meet the hash algorithm requirements, because it doesn't have the correct
constructors and because the copy constructor is allocating and potentially
throwing (the latter is not, strictly speaking, a violation but it's still
undesirable.)

> Is tag_invoke superior because we can declare customization points for classes
> that we can't modify?

Yes.

> But we need access to data members so we need priviledge access to that type...

Not always. E.g. std::complex or std::error_code provide accessors for
the necessary members.

> - The documentation shows (in the chapter "Use with Unordered
> Containers") several "hash" classes that are needed to be able to use
> Hash2 with std/boost unordered_xx classes (or
> boost::intrusive::unordered_xxx types). I think it would be a good idea to
> include such utilities in this Hash2 library (under names like "hasher",
> "seeded_hasher".

One of these, or maybe some different one, will eventually be included,
once I decide on which.

> - The design says "The size_type member type of the flavor affects how
> container and range sizes (typically of type size_t) are serialized.
> Since the size of size_t in bytes can vary, serializing the type directly results in
> different hash values when the code is compiled for
> 64 bit or for 32 bit. Using a fixed width type avoids this".
>
> I don't agree with this, between 32 and 64 bit compilation you will have
> differences when hashing user-defined types that have pointers, stored
> std::size_t/ptrdiff_t members, etc..

There are still a lot of types for which the size type is the only source of
variability, e.g. std::vector<int>.

> IMHO xxx_flavours defined in the library
> should define size_type as "std::size_t". Those wanting portable hashes
> between 32 and 64 bit need to do a bigger effort than configuring Hash2, and
> probably defining size_type as std::uint64_t will reduce the performance of 32
> bit users...

It doesn't.

I initially had size_type defined as uint32_t for similar reasons, but it doesn't
really matter, so I changed it to uint64_t.

> - One thing I don't like is that the library depends on container_hash/describe
> and mp11. I would expect this library to be dependency-free (other than
> Config/Assert). Specially I dislike the "Describe" dependency. Is there a way to
> support Described classes without depending on that library?

No.

> - There is an asymmetry between "writeNNle" and "writeNNbe"
> implementations (little vs big endian). In little endian implementations
> "detail::is_constant_evaluated() and endianess" is used to be able to use
> memcpy in some cases
> (https://github.com/pdimov/hash2/blob/develop/include/boost/hash2/detail
> /write.hpp#L26)
> where in the big endian implementation this is not performed
> (https://github.com/pdimov/hash2/blob/develop/include/boost/hash2/detail
> /write.hpp#L73).
>
> Probably few people use big endian platforms (except those like me that still
> need to compile for powerpc targets...), but I would expect an equivalent
> implementation for big-endian functions.

The implementations are equivalent in practice. GCC and Clang recognize
the pattern used and turn it into a single mov; the memcpy path is only
necessary for MSVC, which doesn't support big endian architectures.

> - class Str example: N is missing the type, missing <algorithm> include,
> operator== has the wrong type...

Sorry, why does operator== have the wrong type?

> - All "Use with Unordered Containers" examples should be compilable,
> including the needed headers.

The examples are actually compilable, I just stripped the #includes when
embedding them in the documentation to avoid the unnecessary repetition.
Maybe I shouldn't have. :-)


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk