|
Boost : |
Subject: Re: [boost] Name and Namespace forPotential BoostExtended Floating-Point
From: Simonson, Lucanus J (lucanus.j.simonson_at_[hidden])
Date: 2011-09-03 20:19:27
Vicente J. Botet Escriba wrote:
> Le 02/09/11 20:34, Simonson, Lucanus J a écrit :
>> John Maddock wrote:
>>>> Gmp mpq_class expressions allocate for temporaries in compound
>>>> expressions.
>>> Nod, one thing I've worked hard on is eliminating temporaries inside
>>> expressions whenever possible, so if you write say:
>>>
>>> a = b + c + d + e + f;
>>>
>>> Then you get no temporaries at all - everthing gets accumulated
>>> inside variable a.
>>>
>>> However, something like:
>>>
>>> a = b*c + d*e;
>>>
>>> does require one temporary for *one* of the multiplications (the
>>> other uses "a" as working space).
>> I really like this, and I think it may be about the best we can do.
>> If you could somehow abstract out the compile time logic of doing
>> this with an expression template system from your application of it
>> to bignum then it could become a boost library extension that might
>> find a home in proto (assuming you use proto). I'd imagine Joel F.
>> would like to discuss this more with you. You've already pushed the
>> expression templates about as far as they can go to accomplish this,
>> and farther than I ever have.
>>
>>> One thing I've thought of is walking the expression tree at compile
>>> time to calculate the largest number of temporaries in the longest
>>> "branch", then creating these as an array at the outermost level of
>>> the expression evaluation and passing them down, so that:
>>>
>>> a = b*c + b*d + b*e + b*f;
>>>
>>> still only needs one temporary - which gets reused for each branch.
>>>
>>> This needs some more thought though - to make sure I don't shoot
>>> myself in the foot and reuse something that shouldn't be.
>> You could cache the array of temporaries in "a" for reuse when "a"
>> is reused.
>>
>> bignum a, b, c, d, e, f;
>> for (int i = 0; i< 1000; ++i) {
>> b = B[i];
>> c = C[i];
>> d = D[i];
>> e = E[i];
>> f = F[i];
>> a = b*c + b*d + b*e + b*f;
>> foo(a);
>> }
>>
>> Allocates a,b,c,d,e,f and two temporaries for a total of 8
>> allocations instead of 6006 with gmpq_class.
>>
>> The only problem is that it is more natural to decare a-f inside the
>> loop or just write:
>>
>> for (int i = 0; i< 1000; ++i) {
>> foo( b[i]*c[i] + b[i]*d[i] + b[i]*e[i] + b[i]*f[i]); }
>>
>> We could make the array mutable and cache it in any of the arguments
>> of the expression, but that wouldn't help in this case, and wouldn't
>> help if they were all rvalues.
>>
>> for (int i = 0; i< 1000; ++i) {
>> foo( b(i)*c(i) + b(i)*d(i) + b(i)*e(i) + b(i)*f(i)); }
>>
>> If we try to fall back to a custom allocator we have all the
>> problems with thread safety and have only really accomplished
>> improving on normal heap allocation by applying a pool. The problem
>> with creating an areana of gmpq_class objects is that what we really
>> want is to cache the object that has the right *size* buffer
>> internally to store the result of the operation. In general that is
>> different from one temporary to the next in the code. If you use an
>> areana you get an aribtrary object and you end up growing all their
>> buffers to store the largest temporary, which is a lot more
>> allocations and a lot more memory consumed. Also, with the areana
>> you can't release the memory without putting that burden on the
>> library user.
>>
>> In the end there are problems with any approach that goes beyond
>> what you've already done and I'm not entirely sure what our best
>> option is. I'm afraid we can never abstract away the management of
>> memory from the user of the code and give them something that can be
>> as efficient as they could be if they managed their allocations
>> explictly.
>>
> Hi,
>
> I don't know if this could help.
>
> The library could provid a class that cache temporary objects build
> and
> used while evaluation an expression assigned to it. This cache will
> provide an explict bignum conversion operator. The user could declare
> this cache outside the loop to avoid the temporary allocations as for
> example:
>
> bignum_cache tmp;
>
> for (int i = 0; i< 1000; ++i) {
> tmp = b[i]*c[i] + b[i]*d[i] + b[i]*e[i] + b[i]*f[i];
> foo( bignum(tmp) ); // explicit conversion
> }
>
> We could see this cache as a smart collection of registers.
Yes, however, the ideal solution would allow a person who wrote code in a templated context with numerical data type T assumed to be a builtin floating point type to substitute the bignum datatype for T and get optimal code. One of my suggestions was to put the bignum_cache inside the bignum datatype itself and max its use implicit.
If you look in the implementation details of Polygon I have what I call a
struct compute_intersection_pack {
typedef typename high_precision_type<Unit>::type high_precision;
high_precision y_high, dx1, dy1, dx2, dy2, x11, x21, y11, y21, x_num, y_num, x_den, y_den, x, y;
...implementation of line intersection operation...
};
That is a struct that has registers for all of the variables I use in the long expressions for computing line segment pair intersection point. These are like registers and I make the pack a member of the object that encapsulated the larger all pairs line segment intersection algorithm to recycle the high_precision values it caches. There are almost as many temporaries as explicit variables in my expression, but even so doing this caching rather than reallocating these dozen or so variables each time I performed the computation made a significant performance improvement before I applied lazy-exact arithmetic and relegated high precision arithmetic to Amdhal's prison of code that could be faster but isn't worth the effort.
Similarly my colleague Andrii used a static array of gmp values as an areana and treated them like registers, unfortuantely not thread safe, so an optional feature of his implementation.
We would like to see a general solution in a boost multiprecision library, but it is important that it not impose a burdon on the user of the library or use of it in a templated context would be prohibitive. Multiprecision::mp_float needs to behave in all ways as much like a regular float as reasonably possible to make its substitution into other boost libraries feasible. The design of the multi-precision library will necessarily involve some trade-offs. I looks like the current bignum expression templates will improve the performance of my own code (wrt number of allocations) if I substitute it for gmpq_class, and I'm pretty happy with that overall. I don't think a boost multiprecision library needs to be perfect, I'll be satisfied that it is an improvement on gmp's own expression templates.
Regards,
Luke
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk