Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] RFC - Updated MapReduce library
From: Craig Henderson (cdm.henderson_at_[hidden])
Date: 2009-08-09 06:24:19

Next message: joel: "Re: [boost] RFC - Updated MapReduce library"
Previous message: Jarl Lindrud: "Re: [boost] [Serialization] Bizarre bug"
In reply to: joel: "Re: [boost] RFC - Updated MapReduce library"
Next in thread: joel: "Re: [boost] RFC - Updated MapReduce library"
Reply: joel: "Re: [boost] RFC - Updated MapReduce library"

> joel wrote:
>
> Some comments:
> * Why do your function object expose a static method instead of being
> real functor or polymorphic function object. What's the rationale being
> this choice ?

This interface has changed several times and I can't decide the most
appropriate.
I have provided a base class to define the required types, hence the use of
a function object. However, it is dangerous to use a real functor with an
instance because the Map Tasks are independent of each other and run in
different threads. If they had instance data, then synchronization becomes
an issue, but more importantly, it breaks the programming model. In a true
distributed system, map tasks will run on separate machines, and therefore
unable to share data. Support for will is intended for a later release of
the library, so I need to keep the design pure.

>
> * Why are timing stats embedded into the library ? I may want to not
> time thing and don't want to suffer from w/e memory footprint of those
> additional member.

These stats are very useful for research and testing, but I agree are less
important in a production environment. The timings need to be built into the
library infrastructure because the library user does not have access to the
granularity of timing (without writing a bespoke schedule_policy). I can
look at making the timing a another policy class, but I don't think the
overhead is really that significant, is it?

> * The interface is, on the user side, extremely verbose. I think some
> functions could extract more information from their input type to
> lessen
> the amount of type definition the user has to write. Using the
> result_of
> protocol here and there could also prevent people to have to memorize
> various unrelated name for inner types.

I'm disappointed you think this. I have worked really hard to make the
interface as light as possible. If you compare the library interface to
other implementations such as Phoenix, I hope you'll agree that this library
is quite light.

The minimum implementation is:
struct map_task : public boost::mapreduce::map_task<k1, v1>
{
    template<typename Runtime>
    static void map(Runtime &runtime, std::string const &/*key*/, value_type
&value)
};
struct reduce_task : public boost::mapreduce::reduce_task<k2, v2>
{
    template<typename Runtime, typename It>
    static void reduce(Runtime &runtime, std::string const &key, It it, It
const ite)
};
main()
{
    typedef
    boost::mapreduce::job<
        map_task
      , reduce_task
> job_t;

    boost::mapreduce::specification spec;
    boost::mapreduce::results result;
    boost::mapreduce::run<job_t>(spec, result);
}

I am keen to make it lighter if you can be specific with some suggestions,
though?

> * Giving performances figures is OK. Comparing them to sensible
> existing
> solution is better. How fast are you with respect to a similar, hand
> coded application using Boost.thread or even openMP. Instead of giving
> absolute time,a better metric could be using cycles per processed
> elements as it gives an architecture independent measure of
> performances.

Agreed on the performance figures, and I'll provide some comparisons in the
future. Jose on this list has helped with some comparison with Phoenix, and
the results are comparable with the WordCount example. You'll appreciate
that I am limited to the machines I have access to, and Phoenix isn't
available on my platform.

>
> * On the same topic, why not providing an openMP version on compiler
> that can support it ?

Only that I am not familiar with openMP, and haven't looked at it. It's
unlikely that I'll be able to do this, but if someone in the Boost community
would like to help out, I'd be delighted.

>
> * Give examples of other types of application that can benefit or can't
> benefit from MapReduce. Can I wrote Image processing with it? HPC ?
> Which caracteristics should my original application have to benefit
> from
> MR ? Counting and sorting words is ok for a tutorial but it doesn't
> give en incentive on how/why to use this library.

In the documentation I did say that I am not providing a tutorial on
programming in MapReduce, but maybe I will one day. I do, however, recognize
that one example does not demonstrate the possibilities for the library, and
I will be providing more samples in the future.

Thanks for your feedback

-- Craig

Next message: joel: "Re: [boost] RFC - Updated MapReduce library"
Previous message: Jarl Lindrud: "Re: [boost] [Serialization] Bizarre bug"
In reply to: joel: "Re: [boost] RFC - Updated MapReduce library"
Next in thread: joel: "Re: [boost] RFC - Updated MapReduce library"
Reply: joel: "Re: [boost] RFC - Updated MapReduce library"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk