[Boost-users] [proto] : Using proto with NVIDIA CUDA nvcc (redux)

29 Oct 2010

      I remember there were postings in this mailing list before about
trying to using proto with NVIDIA's CUDA compiler, and Joel Falcou has
contributed some code to make proto work with nvcc. Well, I was trying
few experiments with the most recent CUDA compiler release (3.2). I
believe one of the exciting things about using proto with CUDA is the
ability to write mini programs in different DSELs using proto and ship
the mini programs over to the "device" side and execute it there. This
idiom will look as follows :

template<typename F>
__global__ void kernel(F f)
{
  f(...);
}

kernel<<<blocksize, gridsize>>>(a+b /* a big proto expression */);

I tried to write the simplest form of the above use case :

#include <boost/proto/proto.hpp>

namespace proto=boost::proto;

template<typename>
struct expr;

struct domain
  : proto::domain<proto::pod_generator<expr> >
{};

template<typename E>
struct expr {
  BOOST_PROTO_BASIC_EXTENDS(E, expr<E>, domain)
};

template<typename F>
__global__ void kernel(F f)
{

}

struct term {};

expr<proto::terminal<term>::type> a = {{{}}};
expr<proto::terminal<term>::type> b = {{{}}};
expr<proto::terminal<term>::type> c = {{{}}};

int main()
{
  kernel<<<1,1>>>(proto::deep_copy(b+c));
  return 0;
}

Note the deep_copy when passing proto expression to kernel functions.
Since proto captures terminals by reference, the host side pointers
that are held in proto expressions won't make sense on the device
side, so at the very least, we have to make copies of terminals before
passing to the device. Now, compiling the above program with nvcc
(CUDA 3.2, host compiler gcc 4.4.3, Ubuntu 10.04) results in the
following error (besides spewing a bunch of warnings) :

$ nvcc -v -keep t.cu -I $HOME/boost/boost-trunk/
...
t.cudafe1.stub.c: In function ‘void
__device_stub__Z6kernelI4exprIN5boost5proto7exprns_10basic_exprINS2_3tag4plusENS2_7argsns_5list2IS0_INS4_IS6_NS8_IS0_INS4_INS5_8terminalENS7_4termI4termEELl0EEEESE_EELl2EEEES0_INS4_IS9_NSA_IiEELl0EEEEEELl2EEEEEvT_(_Z4exprIN5boost5proto7exprns_10basic_exprINS1_3tag4plusENS1_7argsns_5list2IS_INS3_IS5_NS7_IS_INS3_INS4_8terminalENS6_4termI4termEELl0EEEESD_EELl2EEEES_INS3_IS8_NS9_IiEELl0EEEEEELl2EEEE&)’:
t.cudafe1.stub.c:8: error: invalid cast from type ‘const
expr<boost::proto::exprns_::basic_expr<boost::proto::tag::address_of,
boost::proto::argsns_::list1<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::plus,
boost::proto::argsns_::list2<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::plus,
boost::proto::argsns_::list2<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal,
boost::proto::argsns_::term<term>, 0l> >,
expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal,
boost::proto::argsns_::term<term>, 0l> > >, 2l> >,
expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal,
boost::proto::argsns_::term<int>, 0l> > >, 2l> >&>, 1l> >’ to type
‘char*’
t.cudafe1.stub.c:8: error: invalid cast from type ‘const
expr<boost::proto::exprns_::basic_expr<boost::proto::tag::address_of,
boost::proto::argsns_::list1<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::plus,
boost::proto::argsns_::list2<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::plus,
boost::proto::argsns_::list2<expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal,
boost::proto::argsns_::term<term>, 0l> >,
expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal,
boost::proto::argsns_::term<term>, 0l> > >, 2l> >,
expr<boost::proto::exprns_::basic_expr<boost::proto::tag::terminal,
boost::proto::argsns_::term<int>, 0l> > >, 2l> >&>, 1l> >’ to type
‘size_t’
# --error 0x1 --

The CUDA compiler works by splitting the original C++ code into "host"
code and "device" code. The device code is completely lowered to C. So
it seems like the above error is when passing a proto expression to
the C-lowered version of device code. The proto expression is a simple
POD type, so I am perplexed by the above error. (BTW, when the host
compiler is gcc 4.1.3, the above program causes a segmentation fault
in gcc, but that's for another forum). This might well be CUDA
compiler's problem, but I dug a little deeper. I transformed the proto
expression into a simpler POD type as follows :

#include <boost/proto/proto.hpp>

namespace proto=boost::proto;

struct filler {};

template<typename tag, typename C0, typename C1=filler>
struct myexpr {
  C0 c0;
  C1 c1;
};

struct myadd {};
struct myterm {};

struct xform_callable
  : proto::callable {

  template<typename>
  struct result;

  template<typename This, typename LHS>
  struct result<This(LHS&)> {
    typedef myexpr<myterm, LHS> type;
  };

  template<typename LHS>
  typename result<xform_callable(LHS&)>::type
  operator()(LHS &l) {
    myexpr<myterm, LHS> e = {l};
    return e;
  }

  template<typename This, typename tag, typename LHS, typename RHS>
  struct result<This(tag, LHS, RHS)> {
    typedef myexpr<tag, LHS, RHS> type;
  };

  template<typename tag, typename LHS, typename RHS>
  typename result<xform_callable(tag, LHS, RHS)>::type
  operator()(tag, LHS l, RHS r) {
    myexpr<tag, LHS, RHS> e = {l, r};
    return e;
  }
};

struct xform
  : proto::or_<
    proto::when<proto::plus<proto::_, proto::_>,
    xform_callable(myadd(), xform(proto::_left), xform(proto::_right))>
  , proto::when<proto::terminal<proto::_>,
    xform_callable(proto::_value)>
...
{};

template<typename>
struct expr;

struct domain
  : proto::domain<proto::pod_generator<expr> >
{};

template<typename E>
struct expr {
  BOOST_PROTO_BASIC_EXTENDS(E, expr<E>, domain)
};

template<typename F>
__global__ void kernel(F f)
{

}

struct term {};

expr<proto::terminal<term>::type> a = {{{}}};
expr<proto::terminal<term>::type> b = {{{}}};
expr<proto::terminal<term>::type> c = {{{}}};

int main()
{
  kernel<<<1,1>>>(xform()(b+c));
  return 0;
}

The above program compiles fine with CUDA 3.2. That leaves me with the
question : what does a proto expression contain more than myexpr
above, that causes nvcc to fail? Again, this might be CUDA compiler's
problem and they are the ones that should fix it. However, if someone
in this forum can think of a simple fix to proto (or some other boost
library that is actually causing the problem), that would be great
too. The uniform template expansion across device and host code in
CUDA programs is a powerful feature, and I envision some cool
libraries that can be written using proto with CUDA. So making proto
work with CUDA is a big win.

Manjunath

[Boost-users] [proto] : Using proto with NVIDIA CUDA nvcc (redux)

Manjunath Kudlur