Subject: Re: [boost] [random] new threefry random engine
From: Thijs van den Berg (thijs_at_[hidden])
Date: 2014-04-19 18:35:09
On 19 Apr 2014, at 19:46, Steven Watanabe <watanabesj_at_[hidden]> wrote:
> On 04/18/2014 02:00 AM, Thijs van den Berg wrote:
>> This algorithm is what they call the 4x64 variant. The paper
>> also talks about a 2x64 and 4x32 variants (but e.g. not 8x32,
>> 2x32). However Im not sure if we need to implement all the
>> variant theve studied (maybe yes, maybe no). They recommend
>> 4x64 with 20 rounds as the best choice for CPUs and thats
>> the main one I want to provide.
> If the algorithm is basically the
> same and only differs by some constants
> in the mixing function, then these constants
> can be turned into template parameters.
> If the differences are more significant,
> then it should be handled by a separate
> engine template, and there's no need to
> worry about it at this point.
Right now Im having performance issues with just this 4x64 version & an arbitrary rounds extension.
* I have a un-rolled version encrypt_counter() that handles up to 20 rounds (or less), and which has all the constants hard coded in the code. This version times 13.8 CPU cycles for 20 rounds.
* I have an arbitrary round version encrypt_counter_0() which looks much nicer from a code point of view, it has a loop and static const arrays containing the rotation constants (that repeat every 8 round). However, it takes 42 CPU cycles for the same 20 rounds. Ive tried many variants and I just cant make the loop version at fast as the un-rolled one
Perhaps someone can explain why that is?
* initially I had a 13 and 20 rounds version. The 13 round version was supposed to be the lowest rounds version to pass all the BigCrush test, but in my tests it failed one out of the 160 tests. Thats perhaps still better than MT which failed 2 or 3 tests (depending on the version of MT) but "failing only 1 test" is not a perfect selling point. Perhaps 14 round would solve that? The 20 rounds version advised in the paper has 7 extra round as "a big safety margin" and that one passes indeed all BigCrunch tests.
My view now is:
* I think we should just provide the 20 rounds version by default and allow developers to pick less rounds -and gain a little speed- if they know what they are doing.
* I like the "big safety margin for a default option, I gives you trust that you can pick this engine if you want the the highest quality random numbers. I think we shouldnt predefine the 13 rounds version with a typedef.
* The factor 2.5 in speed drop for writing an arbitrary round loop version is not worth it IMO.
So I propose:
* to only typedef the 20 rounds version, allow for less round via a template argument:
* the predefined typedef engines could be named threefry4x64_20 (20 rounds, 32 bit integers result_type) and threefry4x64_20_64 (20 rounds, 64 bit integers result_type). This naming conventions allows all the other variant to be added later if we want to.
* have an upper bound of 20 rounds instead of arbitrary rounds because of performance issues I cant solve.
..this give an engine with the follow properties..
* It is the authors pick
* it is very fast, 13.8 CPU cycles, I dont expect I can improve the speed of the un-rolled code any further
* it has the highest random quality
* it has great features for large scale parallel computing
Whats your view on limiting the round to <=20 for the template and providing only the 20 round as a typedef?
I have addressed most other points youve mentioned, but the performance issue of a generic rounds version has failed me.
>> AFAIK the state and the
>> return_type need not be the same. Any 32 bit engine can be
>> turned into a 64 engine by patching two returns together.
> This can be done outside of threefry by
> In Christ,
> Steven Watanabe
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk