Boost logo

Boost :

From: Ion Gaztañaga (igaztanaga_at_[hidden])
Date: 2024-12-08 19:39:49


El 08/12/2024 a las 12:42, Peter Dimov via Boost escribió:
>
> ... this is implied by "equivalent to the original", because if the copy produced
> a different output, it wouldn't be equivalent.
>
>> - Should the HashAlgorithm support calls to "update" and "result"
>> several times, interleaving both calls?
>
> It should, yes. Any calls are allowed, in any order.

Understood.

>> - I don't know if SHA-1/RIPEMD-128 should be in the library at all (insecure
>> and not sure if useful in any real context).
>
> SHA-1 is still widely used, e.g. for Git commit hashes. We added RIPEMD-128
> in order to have a recommended replacement for MD5 in the case where
> the digest has to be exactly 128 bits.

Ok.

>> - The original proposal mentions a "variadic version of hash_append",
>> (https://www.open-
>> std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#variadic)
>> maybe that would be a useful utility in the library.
>
> It might be, yes.

Let's add it to the to-do list ;-)

>> - The hash_append_unordered_range function is surprising for people like me
>> that are not experts on hashing, but it's an important concept for users. The
>> original proposal explains the problem (https://www.open-
>> std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#unordered),
>> I feel the problem should be also explained in the documentation of Hash2,
>> and how the library design solves it (using a copy of the HashAlgorithm, as
>> proposed in n3980).
>>
>> N3980 also says that "And there are various other schemes. They are all
>> implementable. But they each have their advantages and disadvantages.
>> Therefore this proposal proposes none of them. Should the future expose the
>> ideal hash_append specification for unordered containers, it can always be
>> added at that time."
>>
>> Why does Hash2 offer a single alternative?
>
> I don't know of a better one.
>
> Constructing a temporary std::set is not a serious alternative. In addition
> to being allocating, slow, and potentially throwing, it also relies on the
> existence of operator<, which an unordered container doesn't require.

Ok. Yes, the other alternative described in the original paper is much
worse. Not sure how the hash quality is weakened with the curren
implementation: all element hashes are converted to 64 bit, added and a
final hash is computed. It seems to me that combining and shuffling all
original result_type bits from each individual hash would provide a
better hash result for the whole sequence, but I'm not certainly able to
mathematically prove that...

>> -N3980 proposes a "type-erase hasher"
>> (https://www.open-
>> std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#type_erased_hasher
>> ),
>> would it be interesting to include it in Hash2?
>
> I had the type erased hasher in the "planned additions" list for a long time,
> partly because I thought I needed it for the hash2sum example. But it doesn't
> meet the hash algorithm requirements, because it doesn't have the correct
> constructors and because the copy constructor is allocating and potentially
> throwing (the latter is not, strictly speaking, a violation but it's still
> undesirable.)

Type-erasing requires trade-offs but for types like pimpl idioms, you
can't template or it's not desirable to template code that previously
was on a cpp. One option is that boost::hash2::type_erased_hasher can
have enough small buffer optimization (at least for hashes implemented
in the boost::hash2 library) so that allocation can be avoided in most
cases. A good type_erased_hasher is not trivial and certainly useful.

>> - The documentation shows (in the chapter "Use with Unordered
>> Containers") several "hash" classes that are needed to be able to use
>> Hash2 with std/boost unordered_xx classes (or
>> boost::intrusive::unordered_xxx types). I think it would be a good idea to
>> include such utilities in this Hash2 library (under names like "hasher",
>> "seeded_hasher".
>
> One of these, or maybe some different one, will eventually be included,
> once I decide on which.

Great.

>> - The design says "The size_type member type of the flavor affects how
>> container and range sizes (typically of type size_t) are serialized.
>> Since the size of size_t in bytes can vary, serializing the type directly results in
>> different hash values when the code is compiled for
>> 64 bit or for 32 bit. Using a fixed width type avoids this".
>>
>> I don't agree with this, between 32 and 64 bit compilation you will have
>> differences when hashing user-defined types that have pointers, stored
>> std::size_t/ptrdiff_t members, etc..
>
> There are still a lot of types for which the size type is the only source of
> variability, e.g. std::vector<int>.
>
>> IMHO xxx_flavours defined in the library
>> should define size_type as "std::size_t". Those wanting portable hashes
>> between 32 and 64 bit need to do a bigger effort than configuring Hash2, and
>> probably defining size_type as std::uint64_t will reduce the performance of 32
>> bit users...
>
> It doesn't.

Show me the numbers ;-) Maybe if benchmark results are added to the
documentation this can be shown.

>> - There is an asymmetry between "writeNNle" and "writeNNbe"
>> implementations (little vs big endian). In little endian implementations
>> "detail::is_constant_evaluated() and endianess" is used to be able to use
>> memcpy in some cases
>> (https://github.com/pdimov/hash2/blob/develop/include/boost/hash2/detail
>> /write.hpp#L26)
>> where in the big endian implementation this is not performed
>> (https://github.com/pdimov/hash2/blob/develop/include/boost/hash2/detail
>> /write.hpp#L73).
>>
>> Probably few people use big endian platforms (except those like me that still
>> need to compile for powerpc targets...), but I would expect an equivalent
>> implementation for big-endian functions.
>
> The implementations are equivalent in practice. GCC and Clang recognize
> the pattern used and turn it into a single mov; the memcpy path is only
> necessary for MSVC, which doesn't support big endian architectures.

Ok, understood.

>
>> - class Str example: N is missing the type, missing <algorithm> include,
>> operator== has the wrong type...
>
> Sorry, why does operator== have the wrong type?

Because operator== for Str compares X instances instead of Str instances:

class Str
{
private:

     static constexpr N = 32;

     std::uint8_t size_ = 0;
     char data_[ N ] = {};

public:

     friend constexpr bool operator==( X const& x1, X const& x2 )
     {
         return x1.size_ == x2.size_ &&
         std::equal( x1.data_, x1.data_ + x1.size_, x2.data_ );
     }

>> - All "Use with Unordered Containers" examples should be compilable,
>> including the needed headers.
>
> The examples are actually compilable, I just stripped the #includes when
> embedding them in the documentation to avoid the unnecessary repetition.
> Maybe I shouldn't have. :-)

I've found (at least in my libraries) that some users like to copy-paste
examples in the doc and want them to work without needing to figure out
what's missing.

Best,

Ion


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk