|
Boost Users : |
Subject: Re: [Boost-users] [MPI, serialization] Segmentation fault in heterogeneous cluster
From: Martin Huenniger (m.huenniger_at_[hidden])
Date: 2010-09-20 11:49:04
Hi Matthias,
yes //and so on does further serializations. Here is the code, there a
some redundancies for later design decisions:
void Ball<Float>::serialize( Archive & ar, const unsigned int version )
{
ar & BOOST_SERIALIZATION_NVP( membership ); //vector<bool>
ar & BOOST_SERIALIZATION_NVP( members ); //vector<int>
ar & BOOST_SERIALIZATION_NVP( r ); //int
ar & BOOST_SERIALIZATION_NVP( is_infinity ); //bool
ar & BOOST_SERIALIZATION_NVP( up_to_date ); //bool
ar & BOOST_SERIALIZATION_NVP( *QR ); //own type, see below
//if( !up_to_date )
//update();
for( int i=0; i<members.size(); ++i ) {
std::cout << members[i] << " ";
}
std::cout << std::endl;
for( int i=0; i<membership.size(); ++i ) {
std::cout << membership[i] << " ";
}
std::cout << std::endl;
std::cout << r << std::endl;
std::cout << is_infinity << std::endl;
}
void Subspan<Float>::serialize( Archive & ar,
const unsigned int version )
{
ar & BOOST_SERIALIZATION_NVP( membership ); // vector<bool>
ar & BOOST_SERIALIZATION_NVP( members ); // vector<int>
for( int i=0; i<dim; ++i )
for( int j=0; j<dim; ++j )
ar & BOOST_SERIALIZATION_NVP( Q[i][j] ); // double
for( int i=0; i<dim; ++i )
for( int j=0; j<dim; ++j )
ar & BOOST_SERIALIZATION_NVP( R[i][j] ); // double
for( int i=0; i<dim; ++i )
ar & BOOST_SERIALIZATION_NVP( x[i] ); // double
for( int i=0; i<dim; ++i )
ar & BOOST_SERIALIZATION_NVP( d[i] ); // double
ar & BOOST_SERIALIZATION_NVP( r ); // int
}
I don't think the problem lies in my serialization routine, because the
output of the serialized stuff looks right. In fact I think, that the
serialization is broken in some way, because if I transmit the
serialized object through a text_archive between 2 processes, the data
gets corrupted sometimes.
Example:
Master sends the data to worker 1
22 serialization::archive 7 0 0 10 1 0 0 0 1 0 0 0 1 0 4 0 4 8 0 1 2 0 1
0 0 10 1 0 0 0 1 0 0 0 1 0 4 0 4 8 0 1 -0.0057421862890573716
0.34419941995529357 -0.93887900530316526 0.026836519283445603
-0.93850327402166567 -0.34422580653310031 -0.99962334332956182
-0.027172853237254602 -0.0038480537386582914 1.3412866491336766 0 0
0.25571772247364938 0.83569200171657654 0 -1.3694313541089012
1.1400579617129889 -0.7062870570269757 -0.68908851940039961
-0.014041993697918429 -0.0017429988348310888 0.95929305277075005
0.17642171085439082 0.22052713856798853 2
Worker 1 is receiving job from Master
22 serialization::archive 7 0 0 10 1 0 0 0 1 0 0 0 1 0 4 0 4 8 0 1 2 0 1
0 0 10 1 0 0 0 1 0 0 0 1 0 4 0 4 8 0 1 -0.0057421862890573716
0.34419941995529357 -0.93887900530316526 0.026836519283445603
-0.93850327402166567 -0.34422580653310031 -0.99962334332956182
-0.027172853237254602 -0.0038480537386582914 1.3412866491336766 0 0
0.25571772247364938 0.83569200171657654 0 -1.3694313541089012
1.1400579617129889 -0.7062870570269757 -0.68908851940039961
-0.014041993697918429 -0.0017429988348310888 0.95929305277075005
0.17642171085439082 0.220527138567988P
Maybe you'll notice the "P" instead of the "53 2" at the end of the
transmission. I have no clue how this can happen. But when it happens
its only long after my program started and the same piece of code that
generates this output has been called for like 50 times.
If I use a binary_archive the program terminates with the error:
Master sends size of data to worker 1
Master sends the data to worker 1
serialization::archive
G?<??Ô????M?????????????)???k*
x???|?n???x???A"????r?H??<?:??t_??!???*?????X?0?????????J?W??V?M???N?z???f<???f<????g?P4?
Worker 1 is receiving job from Master
terminate called after throwing an instance of
'boost::archive::archive_exception'
what(): invalid signature
[ipc858:10286] *** Process received signal ***
[ipc858:10286] Signal: Aborted (6)
[ipc858:10286] Signal code: (-6)
[ipc858:10286] [ 0] /lib/libpthread.so.0 [0x7f0d10392a80]
[ipc858:10286] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f0d10062ed5]
[ipc858:10286] [ 2] /lib/libc.so.6(abort+0x183) [0x7f0d100643f3]
[ipc858:10286] [ 3]
/usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x114)
[0x7f0d10b02294]
[ipc858:10286] [ 4] /usr/lib/libstdc++.so.6 [0x7f0d10b00696]
[ipc858:10286] [ 5] /usr/lib/libstdc++.so.6 [0x7f0d10b006c3]
[ipc858:10286] [ 6] /usr/lib/libstdc++.so.6 [0x7f0d10b007aa]
[ipc858:10286] [ 7]
/home/pirx/local/lib/libboost_serialization.so.1.44.0(_ZN5boost7archive21basic_binary_iarchiveINS0_15binary_iarchiveEE4initEv+0x166)
[0x7f0d11d1c726]
[ipc858:10286] [ 8]
./my_complex(_ZN2FC6WorkerINS_4BallIdEES2_E8get_workERS2_+0x227) [0x441677]
[ipc858:10286] [ 9]
./my_complex(_ZN2FC12My_complexIdE13working_horseERSo+0x1a4) [0x442c24]
[ipc858:10286] [10]
./my_complex(_ZN2FC12Flow_complexIdE7computeERSo+0x87) [0x445097]
[ipc858:10286] [11] ./my_complex(main+0xc44) [0x42c9f4]
[ipc858:10286] [12] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f0d1004f1a6]
[ipc858:10286] [13] ./my_complex(__gxx_personality_v0+0x191) [0x42b6a9]
[ipc858:10286] *** End of error message ***
mpirun noticed that job rank 0 with PID 10285 on node ipc858 exited on
signal 15 (Terminated).
Regards,
Martin
Matthias Troyer wrote:
> On Sep 16, 2010, at 1:21 PM, Martin Hünniger wrote:
>
>> Hello,
>>
>> I have a similar problem here. I try to send data from one process to another (mpirun -np 2). The dataI use is serialized in the appropriate way. If I send it to a text archive I can it restore again from this text archive and all is ok. But when I try to send the data between the processes something goes wrong and the data is not restored correctly.
>>
>> I have the following serialization routine:
>>
>> template<typename Float>
>> template<class Archive>
>> void Ball<Float>::serialize( Archive &ar, const unsigned int version )
>> {
>> ar & BOOST_SERIALIZATION_NVP( members ) // std::vector<int> members
>> // and so on...
>>
>> // Testing:
>> for( int i=0; i<members.size(); ++i )
>> std::cout << members[i] << " ";
>> std::cout << std::endl;
>> }
>>
>> the output on sending such an object is (for example)
>> 0 1 3 2
>>
>> the output on receiving the same object is
>> 0 0 0 0
>>
>> So there seems to be something wrong during storing the data in ar or during the restoration process.
>>
>> Did someone in this thread come already to a solution to this problem?
>>
>> Cheers,
>> Martin
>
> I'll take a look this weekend. I assume the "// and so on.." does not do any further serialization?
>
> Matthias
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net