mpfr_float stack allocation

Hi Boost Users, I use Boost.Multiprecision's wrappers for MPFR, particularly the boost::multiprecision::mpfr_float. I've had good success with this library and its types so far, enjoying performance gains because of expression templates, and easy access to mpfr and gmp for floats, ints, and rationals. However, in my complex number type, I am struggling with named real temporaries in arithmetic operators. I am looking for help with multiplication, and I can generalize to all other operations needing a real temporary later. Here's my current code for multiplication, implemented in terms of *=. /** Complex multiplication. uses a single temporary variable 1 temporary, 4 multiplications */ complex& operator*=(const complex & rhs) { mpfr_float re_cache(real_*rhs.real_ - imag_*rhs.imag_); // cache the real part of the result imag_ = real_*rhs.imag_ + imag_*rhs.real_; real_ = re_cache; return *this; } I am using the naive evaluation formula, for better or worse. But the particular method of evaluation doesn't matter to me right now, it's the temporary variable re_cache used to cache the real part is killing me. Profiling using Allinea, gprof, and Callgrind reveals that a significant percentage of time in the program is consumed creating and destroying re_cache, as well as other temporaries scattered throughout my code. In an effort to eliminate heap allocations, I first tried replacing my using mpfr_float = boost::multiprecision::number<boost::multiprecision::mpfr_float_backend<0,boost::multiprecision::allocate_dynamic>, boost::multiprecision::et_off>; with allocate_stack, just for these named temporary variables. However, the documentation <http://www.boost.org/doc/libs/1_60_0/libs/multiprecision/doc/html/boost_multiprecision/tut/floats/mpfr_float.html> states that allocate_stack only works in fixed precision, so the 0 digits indicating variable precision should cause problems? I expected 0 digits to fail to compile, but compile it does. The conversion from the type with allocate_stack to allocate_dynamic is not provided by the = operator, so I can write re_cache.convert_to<mpfr_float_dynamic>(). But using variable precision with stack allocation appears to cause all sorts of problems. Everything depending on complex multiplication in my program breaks, indicated by massive failures in my test suite. Thus, I currently regard allocate_stack as a non-solution. My question fundamentally regards how to deal with re_cache and the *= operator for complex multiplication. Anyone have ideas on how to get rid of constant heap de/allocation of re_cache without inducing a ton more arithmetic? Using a static for it is not a solution, both because I anticipate multithreading in this application in the future, and the fact that precision will vary though the run, so I'd have to check the precision of re_cache on every evaluation. The C version of the program I am re-implementing used OpenMP for threads, and used a thread-id-indexed global for re_cache. That solution then forces OpenMP onto anyone wanting to use the complex class in multiple threads. Hence, I view this as a non-solution, too. Again, my initial thought was 'heap allocation is the problem here, so I'll use the stack allocated backend'. But variable precision doesn't appear to work with allocate_stack. And statics are no good. Any thoughts? Thanks for your time, Daniel Brake Postdoc University of Notre Dame Applied and Computational Math and Stats

In an effort to eliminate heap allocations, I first tried replacing my using mpfr_float = boost::multiprecision::number<boost::multiprecision::mpfr_float_backend<0,boost::multiprecision::allocate_dynamic>, boost::multiprecision::et_off>; with allocate_stack, just for these named temporary variables.
Shouldn't you be using et_on based on your comments above? In any case you've found a bug, using mpfr_float_backend<0,boost::multiprecision::allocate_stack> creates a one-bit float, which I suspect is not what you wanted ;) I'll fix that so that it causes a compiler error.
However, the documentation <http://www.boost.org/doc/libs/1_60_0/libs/multiprecision/doc/html/boost_multiprecision/tut/floats/mpfr_float.html> states that allocate_stack only works in fixed precision, so the 0 digits indicating variable precision should cause problems? I expected 0 digits to fail to compile, but compile it does. The conversion from the type with allocate_stack to allocate_dynamic is not provided by the = operator, so I can write re_cache.convert_to<mpfr_float_dynamic>(). But using variable precision with stack allocation appears to cause all sorts of problems. Everything depending on complex multiplication in my program breaks, indicated by massive failures in my test suite. Thus, I currently regard allocate_stack as a non-solution.
My question fundamentally regards how to deal with re_cache and the*= operator for complex multiplication. Anyone have ideas on how to get rid of constant heap de/allocation of re_cache without inducing a ton more arithmetic? Using a static for it is not a solution, both because I anticipate multithreading in this application in the future, and the fact that precision will vary though the run, so I'd have to check the precision of re_cache on every evaluation. The C version of the program I am re-implementing used OpenMP for threads, and used a thread-id-indexed global for re_cache. That solution then forces OpenMP onto anyone wanting to use the complex class in multiple threads. Hence, I view this as a non-solution, too.
Again, my initial thought was 'heap allocation is the problem here, so I'll use the stack allocated backend'. But variable precision doesn't appear to work with allocate_stack. And statics are no good.
Any thoughts?
Yes, but none you may like. * You could use Boost.Thread for thread-local statics. * In C++11 you can use thread_local storage, see http://en.cppreference.com/w/cpp/language/storage_duration * For pre-C++11 you could use __thread or __declspec(thread) in a non-portable compiler-specific way. * If this was C99 you could use a variable-length-array and initialize a temporary mpfr_t yourself. * Ditto, but using alloca (this does work in some C++ implementations but not all). * You could use a temporary buffer big enough for the largest precision you ever use, and use that to initialize an mpfr_t yourself. Of course this may run you out of stack space ;) And finally... you may not see as much speedup as you expect unless the precision is low - for any significant number of digits the cost of the arithmetic typically far outweighs everything else. HTH, John.
participants (2)
-
Daniel Brake
-
John Maddock