|
Boost : |
From: Peter Dimov (pdimov_at_[hidden])
Date: 2024-12-12 14:42:12
Samuel Neves wrote:
> > But I wonder if this preserves the cryptographic quality of original hash.
>
> It doesn't, even if you preserve the length of the hash. There are two attack
> approaches here, depending on the size of the sets you can work with.
>
> On the low-data side, you can turn collision or second-preimage finding into a
> lattice problem. Suppose you have some set whose intermediate hashes are
> [a=13816517845079277968, b=2675547450723878905,
> c=18133324598626471763, d=10200581252073748329]. Build the lattice
> spanned by the rows
>
> [K*2^64, 0, 0, 0, 0]
> [K*a, 1, 0, 0, 0]
> [K*b, 0, 1, 0, 0]
> [K*c, 0, 0, 1, 0]
> [K*d, 0, 0, 0, 1]
>
> where K is a constant large enough to force the first coordinate of vectors to
> have more weight and sum to 0. Short vectors of this lattice will have the form
> [0, x, y, z, w], which means that x*a + y*b
> + z*c + w*d = 0 (mod 2^64), or in other words the set [a,b,c,d] and
> the set [[a]*(x+1), [b]*(y+1), [c]*(z+1], [d]*(w+1)] have the same hash.
>
> In the above case we can use K=2^32 and use LLL to obtain the vector [0,
> 2870, 27951, 8550, 46911] and thus a second-preimage for the above hash.
> Shorter second-preimages are possible by increasing the dimension. In
> particular, for 64-bit hashes it is possible to find lattice vectors containing
> coefficients only in {-1,0,1}, corresponding to removals and additions of single
> elements from the original set, at dimension ~48.
>
> Note that while finding shortest vectors in a lattice is a very hard problem as
> the dimension grows, here we don't really have to find exceedingly short
> vectors; they must instead have either -1 (removal) or positive coefficients.
> For collision-finding, where we control both inputs, we can start with many
> repetitions of the same element to make the problem easier and allow
> reasonably small negative coefficients as well. See also [2, Appendix B].
>
> For larger sets, the generalized birthday attacks from Wagner [1, Section 4]
> also should apply here with minimal modifications, and reduces the security of
> an n-bit sum to O(2^{2sqrt(n)}).
>
> As far as I can tell the only successful multiset hashes either require enormous
> moduli or use elliptic curves.
>
> [1] https://www.iacr.org/archive/crypto2002/24420288/24420288.pdf
> [2] https://arxiv.org/pdf/1601.06502
Thanks again for this informative post.
It should probably be noted here that the current approach, inadequate as it is,
does include the size of the multiset in the hash. I think that this makes it a bit
harder to hack.
I've come up with several possible approaches to "harden" the current algorithm,
and I'll use the opportunity to post them for discussion while we're all tuned to
the hash2 wavelength.
These are all independent; they can be applied separately from one another.
1. As suggested, extend the width of the accumulator from 64 bits to 256 bits,
by using uint64_t[4], with element-wise addition.
2. In addition to computing the sum of the hashes w = sum(hi), also compute
the sum of hi xor C, where C is a constant, e.g. 0x9e3779b97f4a7c15. Include it
in the final hash as well.
This can be extended to computing the sums of hi ^ C1, hi ^ C2, and so on.
It can also be generalized to computing the sum of f(hi), where f is some
arbitrary transformation function, preferably nonlinear. E.g. mulx(hi, C),
where mulx is the primitive created by performing a double width multiplication
(64x64 -> 128) and then xor-ing the high and low parts of the result.
3. In addition to computing the sum of the hashes, also compute a histogram
of their hex digits. That is, maintain an array of type uint64_t[16] and increment
its i-th element each time the hex digit i appears in an intermediate hash. Then
include this histogram in the final hash.
4. Use a two pass approach. On the first pass, compute the hash as is done today,
and add it to the hash algorithm state, as done today. Then, on a second pass,
do the same calculation, but starting from the current state. That is, effectively,
compute the intermediate hashes using a hash function H' that is dependent on
the contents of the multiset.
I can easily see that each of 1-4 makes attacks harder, but I lack the knowledge
to estimate how much harder. Maybe some of these are completely unnecessary
because they are trivial to work around. Maybe not.
Comments appreciated.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk