Boost logo

Boost :

From: Dave Harris (brangdon_at_[hidden])
Date: 2005-03-15 11:20:14


In-Reply-To: <d11prg$uvt$1_at_[hidden]>
daniel_at_[hidden] (Daniel James) wrote (abridged):
> > Firstly, I think we have some agreement that the signature:
> >
> > size_t hash_range( It first, It last );
> >
> > is wrong
>
> I didn't get the impression of any agreement about that.
>
> > and there should be some way to provide an initial seed.
>
> Although, I agree with this.

That's all I meant. I knew there wasn't much agreement over whether side
effects are bad :-)

My review was a lot more conservative than my comments in discussion.

> > We have some agreement that using reinterpret_cast would be better.
>
> Yes, and this will be fixed.

When I wrote it seemed there were still unresolved issues. In fact, I
don't believe anyone has yet posted an acceptable hash_value for pointers.
Presumably we want to shift out bits which are always 0 for alignment
reasons.

> It is proposed that the standard explicitly specifies the hash
> functions (and I've been criticized for not doing that) but part
> of the purpose of this library is to get real world experience
> on what the specification should be. And change it, if necessary.

OK.

> I do intend to spend more time looking into alternatives in the future
> (and will look into adding a constant to hash_combine very soon).

OK. Although I suggested it, I'm not sure it's the best fix, as compared
with using non-zero initial seeds.

    const size_t default_seed = 0x9e3779b9; // Or whatever.
    
    size_t hash_range( It first, It last, seed=default_seed );

A signature like that provides a hook to introduce variation between
similar types if the author desires, and also allows ranges to be hashed
as subranges with the output of one subrange feeding into the next. The
default value allows the issue to be safely ignored by naive users. So
it's a good idea anyway, and since it also fixes the problem with zeros
maybe we don't need to fix that twice.

This does leave the issue of other places where hash_combine is called. It
would be sufficient if we could be sure that all used a non-zero initial
seed, eg:

    size_t hash_value( std::pair<int,int> p ) {
        size_t hash = default_seed;
        combine_hash( hash, p.first );
        combine_hash( hash, p.second );
        return hash;
    }

and the documentation would encourage that, but we can't really be sure of
it. Another approach is to change all the primitive functions. For
example:

    size_t hash_value( int i ) {
        return static_cast<unsigned int>(i) + default_seed;
    }

Introducing the non-zero constant here is surely as efficient as in
combine_hash, if not more so. Admittedly it doesn't eliminate the zero, it
just moves it around, but it should massively reduce its dynamic
frequency. (Actually it will eliminate the zero where it matters most, ie
for types like bool, char and short which are smaller than size_t.)

> Did you see Peter's original proposal?

Yes, I had followed most of the links.

> > For example, the proposal has no specialisation for chars at all. I
> > can't tell if that is an oversight, or whether it is a design
> > aim [...]
> I don't think it's either. Peter?

Peter has now said it is a design aim. That even you weren't sure
illustrates my point :-)

> When hash_range is called with a seed argument, it would be equivalent
> to calling hash_combine for pair<int, int>.

If I understand the principles being advocated here, pair< pair<int,int>,
pair<int,int> > should have the same hash_value as int[4]. How do you
achieve that without giving hash_value( pair ) an initial seed?

-----
I find myself increasingly drawn to the idea of a hasher object. The main
signature should be like:

    template <typename T>
    Hasher &hash( Hasher &hasher, const T &t );

with implementations looking like, eg:

    template <typename First, typename Second>
    Hasher &hash( Hasher &hasher, const pair<First,Second> &p ) {
        hash( hasher, first );
        hash( hasher, second );
        return hash;
    }

Then hash_value, if we still need it, would be like:

    template <typename T>
    size_t hash_value( T &t ) {
        Hasher hasher;
        hash( hasher, t );
        hasher.finalise();
        return hasher.value();
    }

but would rarely be called directly. This solves the problem of the
default value, allows us to store intermediate results bigger than size_t
(which may be important for cryptographic-strength hashing), and the
finalise() step provides a hook for the final mixing/avalanching which
some algorithms use (eg Paul Hsieh's).

> Perhaps it would be better spelt as:
>
> template <class It>
> void hash_combine(std::size_t& seed, It begin, It end);

Yes, I considered that. I wasn't sure whether overloading the name would
help or hinder - hash_range and hash_combine are not substitutable for
each other.

> Well, we did discus the type of review before it started, and nobody
> raised any objections.

I don't object to quick reviews in principle (although if someone can only
contribute, eg, on Sunday mornings, a 5-day period is going to leave them
out completely).

> > I am not an expert on hashing, but I have implemented my own hashing
> > library. I have encountered practical problems due to different
> > objects having the same hash values.
>
> Is this in C++?

Yes. The main problem was zeros. My hash primitives produced zeros and my
hash_combine turned zeros into more zeros. Just as with the initial boost
proposal.

> I would have thought that type safety would have helped avoid
> such problems.

This was specifically an issue with optional or alternate types. For
example, I had something similar to:

    size_t MyType::hash_value() const {
        return use_a ? hash_value(a) : hash_value(b);
    }

where a and b could be different types. We can avoid that mistake in
library classes like boost:variant, but that doesn't prevent it recurring
in user code.

When I am dealing with types without much intrinsic variation, eg bools, I
found it important to exploit such variation as I did have as much as
possible. Hashing false and 0 to different values helped.

> Thank you for reviewing the library and bringing up several good points.

My pleasure.

-- Dave Harris, Nottingham, UK


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk