Boost logo

Boost :

From: Alexander Grund (alexander.grund_at_[hidden])
Date: 2024-12-04 08:37:24


>> 2) I am hashing one stream of bytes, but I do not have them all at the
>> moment so I am passing it to hasher as they arrive(e.g. receiving long
>> message over tcp, but hashing it as we get parts to minimize latency of
>> computing hash after entire message is received)
> The digest of a single binary blob should be calculated as-if first
> submitting all the bytes to the hash function sequentially, and then
> submitting the size of the blob in bytes as std::size_t. I don't think this
> can be done through the type hashing interface and has to repeatedly call
> instead the function which takes a void pointer and size. And at the end of
> that, hashing a value of type `std::size_t`.
I think this is a valid concern as it might be a common use case.
IMO this should be considered in the interface and/or at least be
covered in the examples of the documentation.

I.e. what to do with code like this:

> span<byte> buffer;
>
> while(connection.readsome(buffer)) {
>   update_hash(buffer);
> }
> hash1 = hash_result()
>
> buffer = connection.readall()
> update_hash(buffer)
> hash2 = hash_result()
>
> assert(hash1 == hash2)


I.e. the result should be independent of the size of the "partial"
buffers which currently isn't the case as each call appends the size.
Keeping track of the total size on the call-site might also be
error-prone. Maybe this could be done internally by providing an
interface that keeps the size as state.

> On the other hand, it makes it trivial to generate collisions
>
> pair<string, string>( "foo", "bar" )
> pair<string, string>( "foob", "ar" )
> pair<string, string>( "fooba", "r" )
I can imagine an argument that the "collision" is intentional here, i.e.
that the `data` really is just "foobarfoobarfoobar"

So no matter which way is used in the end, it might be surprising to
some people.




Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk