[proto] : Using proto with NVIDIA CUDA nvcc (redux)

I remember there were postings in this mailing list before about trying to using proto with NVIDIA's CUDA compiler, and Joel Falcou has contributed some code to make proto work with nvcc. Well, I was trying few experiments with the most recent CUDA compiler release (3.2). I believe one of the exciting things about using proto with CUDA is the ability to write mini programs in different DSELs using proto and ship the mini programs over to the "device" side and execute it there. This idiom will look as follows : template<typename F> __global__ void kernel(F f) { f(...); } kernel<<<blocksize, gridsize>>>(a+b /* a big proto expression */); I tried to write the simplest form of the above use case : #include <boost/proto/proto.hpp> namespace proto=boost::proto; template<typename> struct expr; struct domain : proto::domain<proto::pod_generator<expr> > {}; template<typename E> struct expr { BOOST_PROTO_BASIC_EXTENDS(E, expr<E>, domain) }; template<typename F> __global__ void kernel(F f) { } struct term {}; expr<proto::terminal<term>::type> a = {{{}}}; expr<proto::terminal<term>::type> b = {{{}}}; expr<proto::terminal<term>::type> c = {{{}}}; int main() { kernel<<<1,1>>>(proto::deep_copy(b+c)); return 0; } Note the deep_copy when passing proto expression to kernel functions. Since proto captures terminals by reference, the host side pointers that are held in proto expressions won't make sense on the device side, so at the very least, we have to make copies of terminals before passing to the device. Now, compiling the above program with nvcc (CUDA 3.2, host compiler gcc 4.4.3, Ubuntu 10.04) results in the following error (besides spewing a bunch of warnings) : $ nvcc -v -keep t.cu -I $HOME/boost/boost-trunk/ ... t.cudafe1.stub.c: In function ‘void __device_stub__Z6kernelI4exprIN5boost5proto7exprns_10basic_exprINS2_3tag4plusENS2_7argsns_5list2IS0_INS4_IS6_NS8_IS0_INS4_INS5_8terminalENS7_4termI4termEELl0EEEESE_EELl2EEEES0_INS4_IS9_NSA_IiEELl0EEEEEELl2EEEEEvT_(_Z4exprIN5boost5proto7exprns_10basic_exprINS1_3tag4plusENS1_7argsns_5list2IS_INS3_IS5_NS7_IS_INS3_INS4_8terminalENS6_4termI4termEELl0EEEESD_EELl2EEEES_INS3_IS8_NS9_IiEELl0EEEEEELl2EEEE&)’: t.cudafe1.stub.c:8: error: invalid cast from type ‘const expr<boost::proto::exprns_::basic_expr<boost::proto::tag::address_of, boost::proto::argsns_::list1<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::plus, boost::proto::argsns_::list2<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::plus, boost::proto::argsns_::list2<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal, boost::proto::argsns_::term<term>, 0l> >, expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal, boost::proto::argsns_::term<term>, 0l> > >, 2l> >, expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal, boost::proto::argsns_::term<int>, 0l> > >, 2l> >&>, 1l> >’ to type ‘char*’ t.cudafe1.stub.c:8: error: invalid cast from type ‘const expr<boost::proto::exprns_::basic_expr<boost::proto::tag::address_of, boost::proto::argsns_::list1<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::plus, boost::proto::argsns_::list2<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::plus, boost::proto::argsns_::list2<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal, boost::proto::argsns_::term<term>, 0l> >, expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal, boost::proto::argsns_::term<term>, 0l> > >, 2l> >, expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal, boost::proto::argsns_::term<int>, 0l> > >, 2l> >&>, 1l> >’ to type ‘size_t’ # --error 0x1 -- The CUDA compiler works by splitting the original C++ code into "host" code and "device" code. The device code is completely lowered to C. So it seems like the above error is when passing a proto expression to the C-lowered version of device code. The proto expression is a simple POD type, so I am perplexed by the above error. (BTW, when the host compiler is gcc 4.1.3, the above program causes a segmentation fault in gcc, but that's for another forum). This might well be CUDA compiler's problem, but I dug a little deeper. I transformed the proto expression into a simpler POD type as follows : #include <boost/proto/proto.hpp> namespace proto=boost::proto; struct filler {}; template<typename tag, typename C0, typename C1=filler> struct myexpr { C0 c0; C1 c1; }; struct myadd {}; struct myterm {}; struct xform_callable : proto::callable { template<typename> struct result; template<typename This, typename LHS> struct result<This(LHS&)> { typedef myexpr<myterm, LHS> type; }; template<typename LHS> typename result<xform_callable(LHS&)>::type operator()(LHS &l) { myexpr<myterm, LHS> e = {l}; return e; } template<typename This, typename tag, typename LHS, typename RHS> struct result<This(tag, LHS, RHS)> { typedef myexpr<tag, LHS, RHS> type; }; template<typename tag, typename LHS, typename RHS> typename result<xform_callable(tag, LHS, RHS)>::type operator()(tag, LHS l, RHS r) { myexpr<tag, LHS, RHS> e = {l, r}; return e; } }; struct xform : proto::or_< proto::when<proto::plus<proto::_, proto::_>, xform_callable(myadd(), xform(proto::_left), xform(proto::_right))> , proto::when<proto::terminal<proto::_>, xform_callable(proto::_value)>
{}; template<typename> struct expr; struct domain : proto::domain<proto::pod_generator<expr> > {}; template<typename E> struct expr { BOOST_PROTO_BASIC_EXTENDS(E, expr<E>, domain) }; template<typename F> __global__ void kernel(F f) { } struct term {}; expr<proto::terminal<term>::type> a = {{{}}}; expr<proto::terminal<term>::type> b = {{{}}}; expr<proto::terminal<term>::type> c = {{{}}}; int main() { kernel<<<1,1>>>(xform()(b+c)); return 0; } The above program compiles fine with CUDA 3.2. That leaves me with the question : what does a proto expression contain more than myexpr above, that causes nvcc to fail? Again, this might be CUDA compiler's problem and they are the ones that should fix it. However, if someone in this forum can think of a simple fix to proto (or some other boost library that is actually causing the problem), that would be great too. The uniform template expansion across device and host code in CUDA programs is a powerful feature, and I envision some cool libraries that can be written using proto with CUDA. So making proto work with CUDA is a big win. Manjunath

On 10/28/2010 9:07 PM, Manjunath Kudlur wrote:
t.cudafe1.stub.c:8: error: invalid cast from type ‘const expr<boost::proto::exprns_::basic_expr<boost::proto::tag::address_of,
Proto overloads the addressof operator (unary &) to build a tree node. It seems somewhere someone is trying to take the address of proto expression and getting a new object instead. (Whoever it is should be using boost::addressof, or the equivalent. That's busted.) Proto address_of nodes are implicitly convertible to an expression pointer to generally avoid this sort of problem, but it looks like the compiler is trying to cast to "char*" instead. (That's also busted.) The workaround would be to disable Proto's operator overload. struct domain : proto::domain< proto::pod_generator<expr> , proto::not_<proto::address_of<proto::_> >
{}; HTH, -- Eric Niebler BoostPro Computing http://www.boostpro.com

The workaround would be to disable Proto's operator overload.
struct domain : proto::domain< proto::pod_generator<expr> , proto::not_<proto::address_of<proto::_> > > {};
HTH,
That does help, and works like a charm. Seems like the C-lowering of the device code is a nasty business, and it's hard to pin point what is busted. But this workaround should be sufficient for many use cases, I think. Manjunath

On 29/10/10 07:25, Manjunath Kudlur wrote:
The workaround would be to disable Proto's operator overload.
struct domain : proto::domain< proto::pod_generator<expr> , proto::not_<proto::address_of<proto::_> >
{};
HTH,
That does help, and works like a charm. Seems like the C-lowering of the device code is a nasty business, and it's hard to pin point what is busted. But this workaround should be sufficient for many use cases, I think.
Wow Jolly Jumper ! This is news. We were looking at OpenCL for a while as the TMP on CUDA was bleak but this is interesting !

Wow Jolly Jumper ! This is news. We were looking at OpenCL for a while as the TMP on CUDA was bleak but this is interesting !
Glad to be of help. A higher level programming model for the GPU, say, a CUDA "backend" for NT2, should be very cool. Manjunath

On 29/10/10 08:07, Manjunath Kudlur wrote:
Wow Jolly Jumper ! This is news. We were looking at OpenCL for a while as the TMP on CUDA was bleak but this is interesting !
Glad to be of help. A higher level programming model for the GPU, say, a CUDA "backend" for NT2, should be very cool.
We're actively working on one.

On 29/10/10 08:07, Manjunath Kudlur wrote:
Wow Jolly Jumper ! This is news. We were looking at OpenCL for a while as the TMP on CUDA was bleak but this is interesting !
Glad to be of help. A higher level programming model for the GPU, say, a CUDA "backend" for NT2, should be very cool.
We're deep in this. Our current experiment on OCL are OKish in term of performance but require some fine tuning on the CL JIT part to avoid recompilation. CUDA could be better but NVIDIA centric. I guess we'll have to do like we did for SIMD and offer multiple back-ends.

That does help, and works like a charm. Seems like the C-lowering of the device code is a nasty business, and it's hard to pin point what is busted. But this workaround should be sufficient for many use cases, I think.
I ran into one more problem with passing proto expression to code running on the GPU. Once I make a proto::deep_copy(), the expression gets copied faithfully to the device. But the problem is in evaluation of an expression on the GPU. The "canonical" way for evaluating a lambda like language in proto is to stash the parameters of the operator() function in a fusion::vector and pass it down as state to the evaluator. But the at_c functions of fusion vector don't have the __device__ attributes. Ok, so this needs some background on CUDA : In CUDA, all the functions that you intend to execute on the GPU device have to be "decorated" with the __device__ attribute, like so: __device__ void foo(int a) { ... } This is so that the CUDA compiler can rip them out into a separate file and compile them for the device. Obviously, a __device__ function can only call other __device__ functions. Now, coming back to evaluating a proto expression, I can easily make the operator() function of the expression as __device__. But if I use fusion vectors for passing around parameters, then I can't call the at_c functions for accessing the parameters, because they don't have the __device__attribute. Well, technically, I can modify the Boost headers and add the __device__ attributes everywhere, but that is intrusive and not desirable. So, is there a way to evaluate a lambda like language in proto some other way? Also, one more problem I can see in proto itself is access to the values stored in terminals. The proto::left() or proto::child_c<0>() functions also don't have the __device__ attributes, so I think I have to resort to direct access of the fields of a proto expression. So, the basic question is, how I can write an evaluator in such a way that I don't call internal proto or other boost functions during any of the intermediate steps? Manjunath

You just hit the first stumpblock that made us go to openCL. The intrusive __device__ add on EVERY friggin function touching GPU code. I see no way to do that without reimplementing large subset of fusion/proto with these modifier on.

On 11/2/2010 1:39 PM, Manjunath Kudlur wrote: <snip>
So, the basic question is, how I can write an evaluator in such a way that I don't call internal proto or other boost functions during any of the intermediate steps?
Seriously? I'm afraid I have no suggestions for you. How does anybody use any third-party library on CUDA? -- Eric Niebler BoostPro Computing http://www.boostpro.com

Seriously? I'm afraid I have no suggestions for you. How does anybody use any third-party library on CUDA?
The short answer probably is, nobody does. The only libraries you can use from the device side of CUDA are those that you wrote yourself (so you would have the __device__ attributes on functions) or those that you have the source code for and painstakingly put __device__ attributes everywhere. Manjunath

On 02/11/10 23:39, Manjunath Kudlur wrote:
Seriously? I'm afraid I have no suggestions for you. How does anybody use any third-party library on CUDA?
The short answer probably is, nobody does. The only libraries you can use from the device side of CUDA are those that you wrote yourself (so you would have the __device__ attributes on functions) or those that you have the source code for and painstakingly put __device__ attributes everywhere.
Yup and this is a MAJOR PITA :/ I tried to come up with a PP trick but nothing ends up working. I wish it was the other way around: default to be callable on device and specify the host only functions.

I'm not sure why they imposed these limitations for inlineable functions. It seems that unless there is an ambiguity (you need a __host__ and a __device__ function), and the function is inlineable, nvcc should easily be able to consume these functions. In fact, in early editions, this often worked (though perhaps was a bug). Brian On Tue, Nov 2, 2010 at 11:34 PM, joel falcou <joel.falcou@lri.fr> wrote:
On 02/11/10 23:39, Manjunath Kudlur wrote:
Seriously? I'm afraid I have no suggestions for you. How does anybody use any third-party library on CUDA?
The short answer probably is, nobody does. The only libraries you can use from the device side of CUDA are those that you wrote yourself (so you would have the __device__ attributes on functions) or those that you have the source code for and painstakingly put __device__ attributes everywhere.
Yup and this is a MAJOR PITA :/ I tried to come up with a PP trick but nothing ends up working. I wish it was the other way around: default to be callable on device and specify the host only functions. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (5)
-
Brian Budge
-
Eric Niebler
-
joel falcou
-
Joel Falcou
-
Manjunath Kudlur