Boost :

Date view	Thread view	Subject view	Author view

From: David Abrahams (dave_at_[hidden])
Date: 2005-11-25 22:56:48

Next message: Pedro Lamarão: "Re: [boost] boost::defer - generalised execution deferral"
Previous message: Jeff Garland: "Re: [boost] [Boost-bugs] [ boost-Feature Requests-1366658 ] Support for DevC++"
In reply to: Robert Ramey: "Re: [boost] [serialization] fast array serialization (10x speedup)"
Next in thread: David Abrahams: "Re: [boost] [serialization] fast array serialization (10x speedup)"

"Robert Ramey" <ramey_at_[hidden]> writes:

> David Abrahams wrote:
>>> "Robert Ramey" <ramey_at_[hidden]> writes:
>
>>> Furthermore, it's not a fair comparison unless you first measure
>>> the number of bytes you have to save so you can preallocate the
>>> buffer. In general the only way to do that is with a special
>>> counting archive, so you have to account for the time taken up by
>>> the counting. Of course we did that test too. The code and test
>>> results are attached.
>
> Without seeing the implementation of binary_oprimitive you plan to use
> I can only speculate what would be the closest test.

?? We're not hiding any code. The code posted compiles as is.

> Assuming that performance is an issue, I wouln't expect you to use
> the current binary_oarchive which is based on stream i/o. So if
> that's an important factor then it shouldn't used for benchmarking.

Did you look at the attached test code at all?

> I presume that is why Matthias chose not to use it. On the other
> hand its not clear why one sould chose to use a buffer based on
> std:vector<char> for this purpose either. I chose an implementation
> which I thought would be closest to the one that would actually end
> up being used for a network protocol.
>
> The question is what is the time difference between one invocation
> of save_binary with data N bytes long vs N invocations of
> save_binary 1 byte long. That is really all that is being measured
> here. So using an implementation of save_binary based on
> stream write isn't really very interesting unless one is really
> going to use that implementation. Of course I don't really
> know if you are going to do that - I just presumed you weren't.

No we are not. But as we have said many times we are not planning to
copy the bytes anywhere. We are going to point MPI at the bytes and
let the network hardware send them directly over the wire. We are
supplying benchmark figures for binary_archive because it's presumably
a case that you care about and understand. The MPI archives will do
something completely different, but with similar performance
characteristics.

>>> In case it isn't obvious to you by now Matthias Troyer is a
>>> world-class expert in high performance computing. You don't get to be
>>> a recognized authority in that area without developing the ability to
>>> create tests that accurately measure performance. You also develop
>>> some generalized knowledge about what things will lead to slowdowns
>>> and speedups. It's really astounding that you manage challenge every
>>> assertion he makes in a domain where he is an expert and you are not,
>>> especially in a domain with so many subtle pitfalls waiting for the
>>> naive tester.
>
> wow - well the bench mark was posted and I took that
> as an indication that it was ok to check it out.

Absolutely it's ok to check it out. Please ask questions if you don't
understand something. Please let us help you.

That said, your willingness to casually label Matthias' work "a
classic case of premature optimization" is really appalling. That's
something you might do with a greenhorn novice who has a lot to learn
about optimization, but to someone with Matthias' distinction it is
inappropriate. Matthias says he doesn't care about whether he is
perceived as credibile but I have a hard time not being offended on
his behalf. As a practical matter, it seems as though you are making
it unreasonably difficult to demonstrate anything to your
satisfaction. Matthias and I (as far as I am able as a non-expert)
are happy to explain the basic facts of performance and large data
sets and to help you understand how these things work, but Matthias'
credentials ought to at least exempt us from having to argue with you
about the validity of his tests, and earn him the right to be treated
with respect.

> Sorry about that - Just go back to the std::vector<char>
> implementation of buffer and well let it go at that.

I don't understand what you mean.

>>>> a) the usage of save_array does not have a huge
>>>> effect on performance. It IS measureable. It seems
>>>> that it saves about 1/3 the time over using a loop
>>>> of saves in the best case. (1)
>>>
>>> In the best case, even with your flawed test, it's a factor of 2
>>> as shown above.
>
> which is a heck of a lot less than 10x

First of all, as demonstrated by Ian, your test is fatally flawed, so
it means nothing that it's a factor of 2 rather than a factor of 10.
Secondly, to anyone who cares about performance, even a factor of two
would be a cause for orgiastic and debauched celebration. A factor of
two performance improvement is rarely available as low-hanging fruit.

>>>> b) In the worst case, its even slower than a loop of saves!!! (2)
>>>> and even slower than the raw serialization system (3)
>>>
>>> That result is completely implausible. If you can get someone
>>> else to reproduce it using a proper test protocol I'll be *quite*
>>> impressed.
>
> Well, at least we can agree on that. We've corrected the
> bench mark and made a few more runs. The anomaly
> above disappears and things still vary but things don't
> change all that much.

?? With the bug corrected, using msvc-8.0, I get

Run #2:
  Time using serialization library: 4.297
  Size is 100000004
  Time using direct calls to save in a loop: 1.766
  Size is 100000000
  Time using direct call to save_array: 0.296
  Size is 100000000

Run #3:
  Time using serialization library: 4.328
  Size is 100000004
  Time using direct calls to save in a loop: 1.781
  Size is 100000000
  Time using direct call to save_array: 0.281
  Size is 100000000

These show 15x speedups.

> BTW, the program has a value
> type which can be set to either char or double which
> tests different primitives. If the results the rest of
> are showing are way differen than yours

I can't understand what you're trying to say. The code you posted has
the value_type as char, so that's what I tested. Of course doubles
are faster to write individually as they are forced into a more
favorable alignment. However, we're still talking about a factor of
2x.

> that might be an explanation.
>
>>>> c) the overhead of the serialization library isn't too
>>>> bad. It does show up when doing 100M characters
>>>> one by one, but generally it doesn't seem to be a big
>>>> issuues.
>>>>
>>>> In my view, it does support my contention that
>>>> implementing save_array - regardless of how it is
>>>> in fact implemented - represents a premature optimization.
>>>> I suspect that the net benefit in the kind of scenario you
>>>> envision using it will be very small.
>>>>
>>>> Obviously, this test raises more questions than it
>>>> answers
>>>
>>> Like what questions?
>
> a) Like the anomoly above - which I don't think is an issue anymore
> b) Will the current stream based implementation of binary_oarchive
> be used?

Used where?

As we have stated many times, we don't plan to do *any* copying in
memory for MPI serialization, so we wouldn't be writing to a stream,
which has a stream buffer and thus necessitates copying.

> or would it be substituted for a different one.
>
> c) What would the results be for the actual archive you plan to use?

If we serialized every double through MPI individually rather than
using a single MPI array send call it would be a factor of at least
1000x. While network bandwidth is the same as memory bandwidth in
fast parallel systems, network latency is much higher than memory
latency (about 3K CPU cycles), and you pay that price for each
individual send.

If we copied into a buffer first -- and remember, we can't copy into a
preallocated buffer for the entire batch of data that needs to be sent
because there isn't enough memory per CPU to copy the data -- we'd pay
the cost of overflow checks on each individual write into the buffer
(similar to what happens with streams) plus an additional 2x speed
penalty just for copying all the data to memory before sending it over
the wire (memory bandwith being equal to net bandwidth).

For MPI serialization, in our application, there really is no
alternative to sending large data sets as single batches.

Based on past experience, I would expect you to challenge the claims
in the three paragraphs above and demand benchmarks that prove their
validity. Then, I would expect you to challenge the validity of the
tests. I really hope you will violate my expectations this time,
since it would be a waste of your time as well as ours.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

Next message: Pedro Lamarão: "Re: [boost] boost::defer - generalised execution deferral"
Previous message: Jeff Garland: "Re: [boost] [Boost-bugs] [ boost-Feature Requests-1366658 ] Support for DevC++"
In reply to: Robert Ramey: "Re: [boost] [serialization] fast array serialization (10x speedup)"
Next in thread: David Abrahams: "Re: [boost] [serialization] fast array serialization (10x speedup)"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk