boost.proto optimization on MSVC (2005/2008)

Hi, I've been trying out boost.proto lately and I've noticed that for my expressions it tends to force my operands to be created on the stack. For example, instead of optimizing out an object completely when everything about it is known at compile-time, the compiler (MSVC 2005 or 2008) is creating that object on the stack anyway.. even though it's not necessary. Even worse, it actually folds the simpler expressions at compile-time and never even touches the data that it just put on the stack. The problem is compounded with larger expressions involving many operands. It even seems to build the larger expression objects on the stack unnecessarily (and this looks like it starts to break constant folding). Here's some of the generated assembly: int main() { 00401800 sub esp,18h vector2< float[3] > v1 = { 0, 1, 2 }; 00401803 xorps xmm0,xmm0 vector2< float[3] > v2 = { 3, 4, 5 }; vector2< float[3] > v3 = { 6, 7, 8 }; // Add two vectors lazily and get the 2nd element. #if 1 // this sticks temporary junk all over the stack, even with 2 terms std::cout << (v1 + v2)[0]; 00401806 fld dword ptr [__real@40400000 (402128h)] 0040180C movss dword ptr [esp],xmm0 00401811 movss xmm0,dword ptr [__real@3f800000 (402124h)] 00401819 movss dword ptr [esp+4],xmm0 0040181F movss xmm0,dword ptr [__real@40000000 (402120h)] 00401827 movss dword ptr [esp+8],xmm0 0040182D movss xmm0,dword ptr [__real@40400000 (402128h)] 00401835 movss dword ptr [esp+0Ch],xmm0 0040183B movss xmm0,dword ptr [__real@40800000 (40211Ch)] 00401843 push ecx 00401844 mov ecx,dword ptr [__imp_std::cout (402044h)] 0040184A fstp dword ptr [esp] 0040184D movss dword ptr [esp+14h],xmm0 00401853 movss xmm0,dword ptr [__real@40a00000 (402118h)] 0040185B movss dword ptr [esp+18h],xmm0 00401861 call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (402040h)] std::cin.get(); 00401867 mov ecx,dword ptr [__imp_std::cin (40203Ch)] 0040186D call dword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::get (402038h)] return 0; 00401873 xor eax,eax } 00401875 add esp,18h 00401878 ret I've come up with a couple workarounds but these only work for specific cases: // (1) This optimizes out nicely, but only works for adding 2 objects. typedef proto::terminal< vector2< float[3] > >::type expr_type; std::cout << (vector_expr< expr_type >( expr_type::make(v1) ) + vector_expr< expr_type >( expr_type::make(v2) ))[0]; // (2) This never optimizes out std::cout << (v1 + v2)[0]; I'd noticed that the reference at traits.hpp line 144 seemed to play a major role (it was the only significant difference between the expression trees generated by the above two snippets.) // (3) This makes (2) optimize just like (1), but doesn't help 3+ // HACK TO REMOVE REF AT boost\xpressive\proto\traits.hpp LINE 144 namespace boost { namespace proto { namespace result_of { template<typename T, typename EnableIf> struct as_arg<T, VectorDomain, EnableIf> { typedef expr<proto::tag::terminal, args0<T /*&*/> > expr_type; typedef typename VectorDomain::template apply<expr_type>::type type; template<typename T2> static type call(T2 &t) { return VectorDomain::make(expr_type::make(t)); } }; }}} Finally, here is the entire source I'm using: #include <cstddef> #include <iostream> #include <boost/xpressive/proto/proto.hpp> #include <boost/xpressive/proto/context.hpp> using namespace boost; template< typename T> struct vector2; template< typename T > struct vector2<T[3]> { T v[3]; typedef T value_type; T const & operator [](std::size_t i) const { return this->v[i]; } T & operator [](std::size_t i) { return this->v[i]; } }; // This grammar describes which lazy vector expressions // are allowed; namely, vector terminals and addition // and subtraction of lazy vector expressions. struct VectorGrammar : proto::or_ < proto::terminal< vector2< proto::_ > > , proto::plus< VectorGrammar, VectorGrammar > > { }; // Expressions in the lazy vector domain must conform // to the lazy vector grammar struct VectorDomain; // Here is an evaluation context that indexes into a lazy vector // expression, and combines the result. struct subscript_context { private: std::size_t i; public: subscript_context(std::size_t i) : i(i) { } // Use default_eval for all the operations ... template< typename Expr, typename Tag = typename Expr::proto_tag > struct eval : proto::default_eval< Expr, subscript_context const > { }; // ... except for terminals, which we index with our subscript template< typename Expr > struct eval< Expr, proto::tag::terminal > { typedef typename proto::result_of::arg< Expr
::type::value_type result_type;
result_type operator ()(Expr const & expr, subscript_context const & ctx) const { return proto::arg(expr)[ctx.i]; } }; }; // Here is the domain-specific expression wrapper, which overrides // operator [] to evaluate the expression using the subscript_context. template< typename Expr > struct vector_expr : proto::extends<Expr, vector_expr< Expr >, VectorDomain
{ typedef proto::extends< Expr, vector_expr< Expr >, VectorDomain > base_type; vector_expr(Expr const & expr) : base_type(expr) { } // Use the subscript_context to implement subscripting // of a lazy vector expression tree. typename proto::result_of::eval< Expr, subscript_context const
::type operator [](std::size_t i) const { subscript_context const ctx(i); return proto::eval(*this, ctx); } };
// Tell proto that in the VectorDomain, all // expressions should be wrapped in vector_expr<> struct VectorDomain : proto::domain< proto::generator< vector_expr >, VectorGrammar > { }; template< typename T > struct is_vector : mpl::false_ { }; template< typename T > struct is_vector< vector2< T > > : mpl::true_ { }; BOOST_PROTO_DEFINE_BINARY_OPERATOR(+, boost::proto::tag::plus, is_vector, VectorDomain); int main() { vector2< float[3] > v1 = { 0, 1, 2 }; vector2< float[3] > v2 = { 3, 4, 5 }; vector2< float[3] > v3 = { 6, 7, 8 }; // Add two vectors lazily and get the 2nd element. #if 1 // this sticks write-only temporary junk all over the stack, even with 2 terms std::cout << (v1 + v2)[0];// + v3 + v3 + v3 + v3 + v3)[0]; #else // this optimizes out completely for 2 terms... expanding to 3 terms does not typedef proto::terminal< vector2< float[3] > >::type expr_type; std::cout << (vector_expr< expr_type >( expr_type::make(v2) ) + vector_expr< expr_type >( expr_type::make(v3) ))[0]; #endif std::cin.get(); return 0; } Thanks in advance. // AJD < n.outpost@gmail.com >

AJD <n.outpost <at> gmail.com> writes:
Here's some of the generated assembly:
int main() { 00401800 sub esp,18h vector2< float[3] > v1 = { 0, 1, 2 }; 00401803 xorps xmm0,xmm0 vector2< float[3] > v2 = { 3, 4, 5 }; vector2< float[3] > v3 = { 6, 7, 8 };
// Add two vectors lazily and get the 2nd element. #if 1 // this sticks temporary junk all over the stack, even with 2 terms std::cout << (v1 + v2)[0]; 00401806 fld dword ptr [__real <at> 40400000 (402128h)] 0040180C movss dword ptr [esp],xmm0 00401811 movss xmm0,dword ptr [__real <at> 3f800000 (402124h)] 00401819 movss dword ptr [esp+4],xmm0 0040181F movss xmm0,dword ptr [__real <at> 40000000 (402120h)] 00401827 movss dword ptr [esp+8],xmm0 0040182D movss xmm0,dword ptr [__real <at> 40400000 (402128h)] 00401835 movss dword ptr [esp+0Ch],xmm0 0040183B movss xmm0,dword ptr [__real <at> 40800000 (40211Ch)] 00401843 push ecx 00401844 mov ecx,dword ptr [__imp_std::cout (402044h)] 0040184A fstp dword ptr [esp] 0040184D movss dword ptr [esp+14h],xmm0 00401853 movss xmm0,dword ptr [__real <at> 40a00000 (402118h)] 0040185B movss dword ptr [esp+18h],xmm0 00401861 call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (402040h)]
std::cin.get(); 00401867 mov ecx,dword ptr [__imp_std::cin (40203Ch)] 0040186D call dword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::get (402038h)]
return 0; 00401873 xor eax,eax } 00401875 add esp,18h 00401878 ret
Just to be clear, the following is the expected assembly: 00401800 fld dword ptr [__real@41100000 (402118h)] 00401806 push ecx 00401807 mov ecx,dword ptr [__imp_std::cout (402044h)] 0040180D fstp dword ptr [esp] 00401810 call dword ptr [__imp_std::basic_ostream<char,std::char_traits<char> >::operator<< (402040h)] 00401816 mov ecx,dword ptr [__imp_std::cin (40203Ch)] 0040181C call dword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::get (402038h)] 00401822 xor eax,eax 00401824 ret Notice how in both versions the actual printed value is folded as a constant at compile-time, loaded onto the stack through the fp instructions, and printed. In the first version of course all that extra data (the vectors) is being copied onto the stack through xmm0, and this data is never touched.

"AJD" <n.outpost@gmail.com> wrote in message news:fj77jn$f3b$1@ger.gmane.org...
Hi, I've been trying out boost.proto lately and I've noticed that for my expressions it tends to force my operands to be created on the stack.
Hi AJD, What compiler (MSVC 2005 or 2008) and what compiler options did you use to generate the assembly code you posted? Also, are you using Proto from release 1.34.1 or from Trunk? Thanks, Dave Jenkins

AJD wrote:
Hi,
I've been trying out boost.proto lately and I've noticed that for my expressions it tends to force my operands to be created on the stack. <snip>
The way you were defining your vector terminals confused proto into thinking they needed to be wrapped in your vector_expr wrapper. The extra layer of indirection is what is flummoxing msvc's optimizer. I can only assume you went through those contortions in order to preserve the nice aggregate initialization for your vector terminals. Try the following instead. #include <cstddef> #include <iostream> #include <boost/xpressive/proto/proto.hpp> #include <boost/xpressive/proto/context.hpp> using namespace boost; // This grammar describes which lazy vector expressions // are allowed; namely, vector terminals and addition // and subtraction of lazy vector expressions. struct VectorGrammar : proto::or_< proto::and_< proto::terminal< proto::_ > , proto::if_< is_array<proto::result_of::arg<mpl::_> > > > , proto::plus< VectorGrammar, VectorGrammar > > {}; // Expressions in the lazy vector domain must conform // to the lazy vector grammar struct VectorDomain; // Here is an evaluation context that indexes into a lazy vector // expression, and combines the result. struct subscript_context { subscript_context(std::size_t i) : i(i) { } // Use default_eval for all the operations ... template< typename Expr, typename Tag = typename Expr::proto_tag > struct eval : proto::default_eval< Expr, subscript_context const > { }; // ... except for terminals, which we index with our subscript template< typename Expr > struct eval< Expr, proto::tag::terminal > { typedef typename Expr::value_type result_type; result_type operator ()(Expr const & expr, subscript_context const & ctx) const { return proto::arg(expr)[ctx.i]; } }; private: std::size_t i; }; // Here is the domain-specific expression wrapper, which overrides // operator [] to evaluate the expression using the subscript_context. template< typename Expr > struct vector_expr { BOOST_PROTO_EXTENDS(Expr, vector_expr<Expr>, VectorDomain) // Use the subscript_context to implement subscripting // of a lazy vector expression tree. typename proto::result_of::eval< Expr, subscript_context const >::type operator [](std::size_t i) const { subscript_context const ctx(i); return proto::eval(*this, ctx); } }; template< typename T, typename U = proto::is_proto_expr > struct vector2; template< typename T > struct vector2<T[3]> { BOOST_PROTO_EXTENDS(typename proto::terminal<T[3]>::type, vector2<T[3]>, VectorDomain) typedef T value_type; T & operator [](std::size_t i) { return proto::arg(*this)[i]; } T const & operator [](std::size_t i) const { return proto::arg(*this)[i]; } }; // Tell proto that in the VectorDomain, all // expressions should be wrapped in vector_expr<> struct VectorDomain : proto::domain< proto::pod_generator< vector_expr > , VectorGrammar > { }; int main() { vector2< float[3] > v1 = { 0, 1, 2 }; vector2< float[3] > v2 = { 3, 4, 5 }; vector2< float[3] > v3 = { 6, 7, 8 }; // Add two vectors lazily and get the 2nd element. std::cout << (v1 + v2)[0]; return 0; } For me, the line "std::cout << (v1 + v2)[0]" generates: ; 107 : // Add two vectors lazily and get the 2nd element. ; 108 : std::cout << (v1 + v2)[0]; fldz push ecx fadd QWORD PTR __real@4008000000000000 fstp DWORD PTR tv227[esp+12] fld DWORD PTR tv227[esp+12] fstp DWORD PTR [esp] call ??6?$basic_ostream@DU?$char_traits@D@std@@@std@@QAEAAV01@M@Z ; std::basic_ostream<char,std::char_traits<char> >::operator<< That looks pretty good to me. HTH, -- Eric Niebler Boost Consulting www.boost-consulting.com

"Eric Niebler" <eric@boost-consulting.com> wrote in message news:<475A0ED4.1010307@boost-consulting.com>...
template< typename T, typename U = proto::is_proto_expr > struct vector2;
template< typename T > struct vector2<T[3]> { BOOST_PROTO_EXTENDS(typename proto::terminal<T[3]>::type, vector2<T[3]>, VectorDomain)
typedef T value_type;
T & operator [](std::size_t i) { return proto::arg(*this)[i]; }
T const & operator [](std::size_t i) const { return proto::arg(*this)[i]; } };
// Tell proto that in the VectorDomain, all // expressions should be wrapped in vector_expr<> struct VectorDomain : proto::domain< proto::pod_generator< vector_expr > , VectorGrammar > { };
Thanks, this works great. The only problem is that vector<> is intended as a wrapper class, and I just implemented it as an array for convenience in this example. It could have any underlying type, not necessarily an array. Is there a proper way to do this without exposing vector<>'s underlying type?

AJD wrote:
Thanks, this works great. The only problem is that vector<> is intended as a wrapper class, and I just implemented it as an array for convenience in this example. It could have any underlying type, not necessarily an array. Is there a proper way to do this without exposing vector<>'s underlying type?
I'm confused. So your vector2<> type is intended as a wrapper for any kind of terminal, not arrays? So how is your example different than any of the examples in proto's docs? Have you tried using proto::extends<> (or BOOST_PROTO_EXTENDS) to extend terminal<T>::type instead of terminal<float[3]>::type ? You'll also need to patch up your grammar to accept any kind of terminal. -- Eric Niebler Boost Consulting www.boost-consulting.com

"Eric Niebler" <eric@boost-consulting.com> wrote in message news:475AEB69.9070304@boost-consulting.com...
I'm confused. So your vector2<> type is intended as a wrapper for any kind of terminal, not arrays?
vector2<> doesn't wrap terminals, it just wraps regular types to provide uniform access to them with a vector (math vector) interface. This way vector2<float[3]>, vector2<float(&)[3]>, and vector2<D3DXVECTOR> could all be used with expressions regardless of their underlying type, and additional types can be used just by specializing vector2<T> and providing the subscript operator as well as some traits.
So how is your example different than any of the examples in proto's docs?
I based it off of the lazyvector example, but the code does not optimize as expected. (In fact, all I really did was swap out the vector type.)
Have you tried using proto::extends<> (or BOOST_PROTO_EXTENDS) to extend terminal<T>::type instead of terminal<float[3]>::type ? You'll also need to patch up your grammar to accept any kind of terminal.
I'm not sure that's the correct approach... proto should not have to know about the internal type of vector2<> at all. I could do this with a helper type where vector2 extends vector2_internal which contains all the functionality, but then I'm faced with the problem that the many specializations of vector2_internal<> could have different constructors which would no longer be accessible through the vector2 interface.

AJD wrote:
"Eric Niebler" <eric@boost-consulting.com> wrote in message news:475AEB69.9070304@boost-consulting.com...
I'm confused. So your vector2<> type is intended as a wrapper for any kind of terminal, not arrays?
vector2<> doesn't wrap terminals, it just wraps regular types to provide uniform access to them with a vector (math vector) interface. This way vector2<float[3]>, vector2<float(&)[3]>, and vector2<D3DXVECTOR> could all be used with expressions regardless of their underlying type, and additional types can be used just by specializing vector2<T> and providing the subscript operator as well as some traits.
OK...
So how is your example different than any of the examples in proto's docs?
I based it off of the lazyvector example, but the code does not optimize as expected. (In fact, all I really did was swap out the vector type.)
Your code doesn't optimize because you replaced the terminal type with something that proto doesn't recognize as a terminal.
Have you tried using proto::extends<> (or BOOST_PROTO_EXTENDS) to extend terminal<T>::type instead of terminal<float[3]>::type ? You'll also need to patch up your grammar to accept any kind of terminal.
I'm not sure that's the correct approach... proto should not have to know about the internal type of vector2<> at all. I could do this with a helper type where vector2 extends vector2_internal which contains all the functionality, but then I'm faced with the problem that the many specializations of vector2_internal<> could have different constructors which would no longer be accessible through the vector2 interface.
Proto builds trees. Trees have terminals and non-terminals. Any type which proto doesn't recognize is assumed to be a terminal and is automatically wrapped. Types which proto recognizes are: - proto::expr<> - Anything that inherits from proto::extends<> - Anything type that uses the BOOST_PROTO_EXTENDS macro Your vector terminals must conform to one of those. Otherwise, proto will make it conform by wrapping it, and that makes your optimizer's job harder. The code I sent has a specialization for vector2<float[3]>. Can't you just add additional specializations to handle the terminals you're interested in? Or define vector2 once as: template< typename T, typename U = proto::is_proto_expr > struct vector2 { BOOST_PROTO_EXTENDS( typename proto::terminal<T>::type , vector2<T> , VectorDomain ) typedef typename vector_traits<T>::value_type value_type; value_type & operator [](std::size_t i) { return vector_traits<T>::index(proto::arg(*this), i); } value_type const & operator [](std::size_t i) const { return vector_traits<T>::index(proto::arg(*this), i); } }; And then specialize vector_traits<>. -- Eric Niebler Boost Consulting www.boost-consulting.com
participants (3)
-
AJD
-
Dave Jenkins
-
Eric Niebler