Boost logo

Boost :

From: Craig Henderson (cdm.henderson_at_[hidden])
Date: 2002-09-10 02:57:49


"Baptiste Lepilleur" <blepctc_at_[hidden]> wrote in message
news:002001c25850$a4716cc0$295ec3d4_at_gaia...
> > template<typename T>
> > class lcs_change : boost::noncopyable
> > {
> > private:
> > std::size_t rec_num_; // record number
> > record_type type_; // record type (see enum above)
> > const T &data_; // record data
>
> Hmm, are all this data really needed ? Couldn't lcs_change be reduced to
> record_type ? It should be possible to recompute other using the original
> sequence.

'Needed' is a good question :-) There is a balance between performance and
storage here. record_type (bad name, really, but has it has a history. I
think it would be better called change_type) on its own is not enough to
reconstruct the second sequence from the first. It simply contains an
enumerated value of KEEP, REMOVE or INSERT, it does not contain any
information which element of the sequence it is talking about. How about I
typedef record_type to be unsigned char and add constants instead of an enum
for the three values. This does loosen the design a little, but may be worth
the price for saving 3 bytes of storage per change?

The data_ member can be removed, but it does place restrictions of the
sequence data type for getting to the information. If the lcs_change class
only contains the element index and the (renamed) change_type, then access
to the data will require a random access iterator on the original sequence.
This would then limit the supported containers to vector, deque, string,
array and valarray.

> Unless I'm mistaken, the current scheme is not very space
> effecient (rougly 12 bytes per change). If you are comparing large text on
a
> character basis that easily size around 1Mo, then the *12 factor became
> really meaningful.

I agree. There are also other space concerns with this algorithm. To
calculate the lcs sequence, an array of
(1+num_first_elements)*(1+num_second_elements)*sizeof(unsigned int) is
required. This becomes very large, very quickly. A proposed extension to the
current implementation (documented in the source header file, unhelpfully to
you!) is to provide an interface for handling large sequences wher the lcs
working array size is larger than that available. This is a fairly
non-trivial sliding implementation, and as yet has not been designed or
implemented. I hadn't mentioned it on the list yet, as I didn't want to open
the discussion too far without the source available, unless anyone brought
the subject up.

>
> The 'computed' data could be provided by for convenience by sugar coat
> iterator (see by the end of the mail).
>
> [...]
>
> > template<typename Seq, typename Res=std::list<
lcs_change<Seq::value_type>
> *> >
> > class longest_common_subsequence : boost::noncopyable
>
> It seems that it should really be:
>
> template<typename ValueType, typename ResultSequence = std::deque<
> lcs_change<ValueType> > >
> class longest_common_subsequence : boost::noncopyable
>
> I believe deque offer a better memory growth scheme (page based). Making a
> double liked list of pointers on lcs_change seems overkill. Or, does the
> algorithm need to insert/remove change in the middle of the result set
> (which would explain the choice of slist) ?

No, you're right, std::deque<> would be a better choice. Thanks, I'll change
that.

> What is the rational for making lcs_change and longest_common_subsequence
> non copyable ? There does not seem to be anything wrong by allowing them
to
> be copyable.

Um, you're challenging me now :-)
These restrictions can be removed without causing any problems. The
longest_common_subsequence was made noncopyable because it held the lcs
working array (as described above) allocated for longer than the duration of
the main algorithm. I made the class noncopyable to prevent this large array
being duplicated and consuming memory unnecessarily. This is no longer the
case (you'll be pleased to hear). However, copying an instance will
duplicate any change_list, which could be large, but perhaps this isn't a
reason to disallow it, I'll mention it in the docs as a performance warning.

>
> [...]
>
> > > 4. Why are you using pointers and not a references?
[...]
> Boost follows Scott Meyers "quality programming practices" given in his
> "Effectice C++" series (see guideline on boost page). Among others, there
is
> a guideline concerning the use of pointer and reference, which, if I
> remember well, goes down to this: "use pointer only when you need to
> represent a NULL pointer, otherwise use reference."

Fair enough.

> > > Here how I invision it
> > >
> > > template<typename InIt1, typename InIt2, typename OutIt >
> > > void
> > > longest_common_subsequence( InIt1 first_begin, InIt1 first_end,
> > > InIt2 second_begin,
> > > OutIt& res_begin );
> > >
> > > template<typename InIt, typename DiffIt, typename OutIt >
> > > void
> > > apply_diff( InIt input_begin, InIt input_end, InIt diff_begin, OutIt
> > > res_begin );
> >
> > Yep, pretty much my though if I get rid of the class.
>
> I actually prefer this interface: it keeps the algorithm interface as
simple
> as possible and mesh very well with existing STL algorithm.

I was toying with the idea of removing the class, but overnight I have
decided not to. I think the interface could expand to provide a richer set
of lcs algorithm functions, and should remain a class. For example, yet
another (!) addition will be to add a method to calculate the length, but
not the content of the longest common subsequence. This requires far less
memory and is a faster algorithm.

>
> It makes it less clumsy to add functions to extend the interface, and
allows
> for automatic type deduction. It also keep the algorithm apart from sugar
> coat functions like apply_diff (by the way, what is it use ? Since
> lcs_change stores a reference on the original data, The second sequence
> obviously still need to exist for apply_diff to work...)

Ah, I think you've found a flaw in the design. It should not be necessary
for the second sequence to exist in order to reconstruct it! That's not
really very useful, and highlights a limitation of my test harness thus far.
I am proposing to add persistence to the lcs_change to store and retrieve
the changes and to enable the user to reconstuct a sequence from an original
sequence + a persisted change_list. As you say, this won't work with the
current design. I'll have to revisit this.

>
> Also, it is a lot more flexible: I can do some really twisted thing in the
> assigment operator of res_begin if I want to. Suppose that lcs_change have
> been simplified down to record_type, I could store the change in a bitset
> using only 2 bits, or recompose a convenient sequence change (as in the
> current lcs_change) on the fly, or do some other processing right away,
> without even needing to store the 'changes'.

Agreed.

>
> In addition, depending on iterators and not sequences provides more
> flexibility. LCS may be used by another algorithm which may only have
> iterators as input!

I provide two ctor, one takes the two sequences by reference and the other
takes begin/end iterator pairs for the sequences. The first simply calls
begin() and end() on the two sequences and _only_ stores the iterators, so
whichever you call, you end up with the same object.

>
> Keeping the algorithm interface as simple and as generic as possible is I
> think a very important element (Guru of the week #84,
> http://www.gotw.ca/gotw/084.htm, helped me a lot understanding that
aspect).
> Helper functions can then easily be added to make the interface friendlier
> if need be.

Agreed.

>
> > It looks like we are fairly close on our thoughts here. Thanks for your
> > interest and all your comments. Like I said, I'll put the code and docs
in
> > the sandbox asap. Any more thoughts, ideas or comments, please post them
> :-)
>
> I look forward to it,
> Baptiste.

Thanks, Baptiste, for your interest and constructive comments. I'll
implement the changes indicated above and they'll go into the sandbox - v.
soon. It is the docs that are holding it up at the moment :-)

Regards
-- Craig


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk