Boost logo

Boost :

From: Peter Dimov (pdimov_at_[hidden])
Date: 2005-03-15 12:53:15


Dave Harris wrote:
> In-Reply-To: <002901c527e7$9f9fb620$6501a8c0_at_pdimov2>
> pdimov_at_[hidden] (Peter Dimov) wrote (abridged):
>> The problem is that hash functions vary wildly in quality. Leaving
>> the default function unspecified (or implementation defined) is good
>> if you are lucky and the implementation you happen to use is good.
>> It's a tradeoff.
>
> I think if we provide a good quality default implementation, and also
> a good rational (and the rational is important here), then vendors are
> unlikely to change it to something worse. Vendors avoid unnecessary
> work.
>
> The eventual standard could recommend a specific implementation in a
> non-normative Note.
>
>
>> A fixed and predictable default means that you don't have to measure
>> the quality of the hash function for every implementation to which
>> you port your program. Your original tests are still valid.
>
> In general, performance varies between platforms. If one implementation
> has the fixed string optimisation and the other doesn't, my
> benchmarks for one will not apply to the other. I am not yet
> convinced that hashing
> deserves to be a special case. On the other hand, it does seem to be
> an area which can benefit from innovation. Eg a platform may have a
> super-fast near-cryptographic-strength algorithm implemented in
> hardware, and we're saying they can't use it?

I think that you are being too optimistic.

The problem with hash functions is that the performance of an unordered_map
lookup is dominated by the amount of collisions the hash function produces
over the specific key set.

Since the vendors do not have access to your key sets, they can't optimize
their hash functions so that they are collision-free. This will inevitably
lead to situations where a vendor supplies a hash function which passes
their regression tests with flying colors but slows down your code to a
crawl.

If you _know_ how the default hash function will perform on your key set,
you can make the decision of whether to use it or not, and if the hash
function is standard, you do not have to re-evaluate your decision each time
you change platforms.

>> The other side of predictability - that the hash function is not
>> affected by types or the particular representation of a sequence of
>> values - means that you don't necessarily have to re-run the
>> performance benchmark of your unordered_maps when you refactor the
>> key to use int[3] instead of struct { short, short, short }, or a
>> wstring instead of a string.
>
> I think you need to repeat your benchmarks anyway.
>
> You may find that values which used to fit in cache no longer do so.
> Or vice versa. If you are hashing pointer values then the collisions will
> vary from one run of the program to the next.
>
> Even if you can get stability for some set of program edits, there
> will always be other edits which don't get it. Reordering the
> hash_combines, or refactoring an object's state, for example.

We are talking orders of magnitude slowdowns here. Not percentage points. We
are talking a std::map beating an unordered_map with a factor of four or
more. This kind of potential porting problems.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk