Boost logo

Boost Users :

Subject: Re: [Boost-users] mpi/serialization:
From: Riccardo Murri (riccardo.murri_at_[hidden])
Date: 2012-02-01 16:32:51


Hello,

I tried to run your code but it's still too big and complex for me to be able
to say anything without a long debugging session, which I cannot do
now. So please take this email with a grain of salt, as I could have
totally misunderstood the code...

I compiled the code you sent with Boost.MPI 1.45 and OpenMPI 1.4.
Running it on two MPI ranks, I always get the same two errors:

* Rank 1 dies with the "archive_exception / array size too short" error, but
* Rank 0 dies with a segmentation fault.

I managed to get rid of the "archive_exception / array size too
short" error from rank 1 (modified slbmpi.h attached), but the rank 0
still segfaults.

(1) Concerning the "array size" error: your code reads:

        reqs.push_back( world.isend( Sender , SendTag ,
&Neighbor2Proc[ i ] , Msg2Send_size ) ) ;

[Aside: I think there's a typo here: the first argument to "world.isend" is the
*destination rank*, so I guess you have "Sender" where "Receiver"
should be...]

This sends Msg2Send_size elements of type "lattice_type" starting at
location "&Neighbor2Proc[i]". However, the corresponding
"world.irecv" has:

        reqs.push_back( world.irecv( msg.source() , msg.tag() ,
Neighbor4Proc[ i ] ) ) ; // , Msg2Send_size: not compiling: request
irecv(int source, int tag, T * values, int n) const; @
http://www.boost.org/doc/libs/1_48_0/doc/html/boost/mpi/communicator.html#id444949-bb

so you are receiving an array of "Msg2Send_size" elements into a
single value of type "lattice_type".

If you change the sender line to:

        reqs.push_back( world.isend( Receiver , SendTag , Neighbor2Proc[i] ) );

then the type of the sent object and the receiving slot do match, and
the error is gone. If you wanted to send more than one element of
Neighbor2Proc, then you have to use an exactly corresponding type in
the recv call.

(2) Regading the segfault: Adding some debug statements to slbmpi.h, I
can see that it dies when executing "world.isend(...,
Neighbor2Proc[i])":

    rmurri_at_xenia:~/tst$ mpirun -np 2 --tag-output ./a.out
    ...
    [1,0]<stdout>:Process=0's MiniGridSize= 3 3 3
    ...
    [1,0]<stdout>:init_internal_neighbors_wf: Process 0 of 2 about to
exchange (if necessary) w/+/-1! Sender=0, Receiver=1,
Neighbor2Proc.size()=1, Msg2Send_size=1, i=0 @ idx= 0
           0 0
    [1,0]<stdout>:
    [1,0]<stdout>:Pause @ "init_internal_neighbors_wf: _slbmpi_h: 108:
pre-exchange" if 1 process: <Enter> or <Return> continues; ^C aborts:
    [1,0]<stderr>:DEBUG: slbmpi.h:124 <==== THIS IS JUST BEFORE
world.isend(...)
    [1,0]<stderr>:[xenia:06768] *** Process received signal ***
    [1,0]<stderr>:[xenia:06768] Signal: Segmentation fault (11)
    [1,0]<stderr>:[xenia:06768] Signal code: (128)
    [1,0]<stderr>:[xenia:06768] Failing at address: (nil)
    [1,0]<stderr>:[xenia:06768] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10060) [0x7f6453010060]
    [1,0]<stderr>:[xenia:06768] [ 1]
./a.out(_ZN5boost7archive4saveINS_3mpi15packed_oarchiveEKP12lattice_typeEEvRT_RT0_+0x14)
[0x44b4aa]
    ...
    [1,0]<stderr>:[xenia:06768] [29]
./a.out(_ZN5boost7archive4saveINS_3mpi15packed_oarchiveEKNS_13serialization5arrayIKP12lattice_typeEEEEvRT_RT0_+0x23)
[0x44a1f0]
    [1,0]<stderr>:[xenia:06768] *** End of error message ***

>From the backtrace it seems that the code dies when performing
serialization of a "lattice_type" element. This leads me to think
that the "lattice_type" element "Neighbor2Proc[i]" has not been fully
initialized.

Now the serialization code for "lattice_type" reads:

    struct lattice_type
    {
        ...
    public:
        lattice_type* neighbors[ en - 1 ];
        ....
    protected:
        template<class Archive> //serializes
(boost::mpi::packed_iarchive& ar) & deserializes
(boost::mpi::packed_oarchive& ar)
        inline void serialize( Archive & ar , const unsigned int )
        {
            ar & neighbors ; // for 'packing' (& unpacking) for
message-passing: put together in 'series' to exchange
        }
        ...

As far as I understand, this means the boost::serialization code will
try to dereference each pointer in the "neighbors" array and serialize
the pointed-to elements.

Could it be that some elements of "lattice_type::neighbors" are NULL
pointers? (It would make sense for elements at a corner of the grid.)

Also, IIRC, the "serialize" code is responsible for serializing *all*
fields in a struct: in this case you are basically transmitting just a
tiny part of the "lattice_type" and will thus get garbage on the
receiving side...

Hope this helps,
Riccardo




Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net