Boost logo

Boost :

From: Joaquin M López Muñoz (joaquinlopezmunoz_at_[hidden])
Date: 2025-05-02 16:31:23


El 02/05/2025 a las 3:05, Ivan Matek via Boost escribió:
> On Fri, May 2, 2025 at 2:10 AM Jean-Louis Leroy via Boost <
> boost_at_[hidden]> wrote:
>
>> Maybe Joaquín wants to jump in, I don't want to put words in his mouth ;-)
>>
> Would be nice, from what I got from his talk some performance I have seen
> in my experiments, e.g. finding perfect hash for 368 items in 512 buckets
> seems almost impossible when we apply formula from the talk.

Ok,
boost::openmethod::policies::fast_perfect_hash::hash_initialize(first,
last) takes a range of N class info objects, calculates M as the lowest
shift so that 1<<M >= N and then randomly tries multipliers C until one
of them is such that

(type_id*C)>>(64-M)

produces no collisions, that is, it acts as perfect hash for the
type_ids in [first, last) into 2^M buckets. 100,000 random multipliers
are tried, and if all fail then M is bumped by 1, 2, 3, and if nothing
works the function bails out. For N = 368, the values of M tried are 9,
10, 11 and 12, which correspond to 512, 1024, 2048 and 4096 buckets,
respectively. Now, according to the formulae from the talk you mention,
the average number of random tries necessary to find a perfect
multiplier (assuming the input is random) is, for each number of
buckets: 512 buckets: 1.63840297782e+80 tries 1024 buckets:
7.17526892837e+32 tries 2048 buckets: 1.83078061406e+15 tries 4096
buckets: 24222060 tries

all of them much greater than the 100,000 limit for each value of M
imposed by hash_initialize. So, as you point out, registration of 368
classes should fail with extremely high probability, right? I wrote this
small program that mocks up the registration of 368 classes and calls
fast_perfect_hash::hash_initialize:

#include <array> #include <boost/openmethod.hpp> #include
<boost/unordered/unordered_flat_set.hpp> #include <iostream> using
namespace boost::openmethod; struct fake_class { fake_class(type_id id)
{ type_ids[0] = id; } auto type_id_begin() { return &type_ids[0]; } auto
type_id_end() { return &type_ids[1]; } type_id type_ids[1]; }; static
constexpr std::size_t num_classes = 368; int main() {
std::vector<fake_class> data; boost::unordered_flat_set<type_id> unique;
std::default_random_engine rnd(13081963);
std::uniform_int_distribution<type_id> d; for(int i=0;i<num_classes;++i)
{ for(;;) { auto id = d(rnd); if(unique.insert(id).second) {
data.push_back(id); break; } } } using fast_perfect_hash=
policies::fast_perfect_hash<BOOST_OPENMETHOD_DEFAULT_POLICY>;
fast_perfect_hash::hash_initialize(data.begin(), data.end()); // locally
tweaked fast_perfect_hash so that these members are public
std::cout<<"hash_mult: "<<fast_perfect_hash::hash_mult<<"\n";
std::cout<<"M: "<<(sizeof(std::size_t) * CHAR_BIT -
fast_perfect_hash::hash_shift)<<"\n"; } As type_ids we simply use random
numbers, and, sure enough, the program fails: could not find hash
factors after 400000s using 8192 buckets
D:\openmethods\x64\Debug\openmethods.exe (process 24820) exited with
code 3. Now, let's change the uniform distribution (d) as follows:
std::uniform_int_distribution<type_id> d(1, 1000); and, with this
change, registration succeeds for 1024 buckets! hash_mult:
16501117525188094459 M: 10 D:\openmethods\x64\Debug\openmethods.exe
(process 24332) exited with code 0.

What's hapenning? In the second run, all the type_ids are contained in a
range that is much smaller than the entire space of 64-bit numbers, so
it is *much much* more likely that a random multiplier will be perfect.
In real life, type_ids are just the addresses of typeid(X) for each
registered class X, and as it happens these usually occupy a contiguous
region of memory instead of being scattered across the 2^64 bytes of the
memory space. To sum up, fast_perfect_hash::hash_initialize works
against theoretical expectation because type_info objects in a program
occupy a small, contiguous range of addresses within the memory space.

Joaquin M Lopez Munoz


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk