Boost logo

Boost :

From: Calum Grant (calum_at_[hidden])
Date: 2005-10-23 10:14:06


Arkadiy Vertleyb wrote
> "Calum Grant" <calum_at_[hidden]> wrote
>
> > > Calum, is it possible to see the latest code of your example?
> >
> > Here it is.
>
> I see...
>
> A couple of questions:
>
> 1) You modify counter all the time -- does RML allow to do
> this with the fields on which the table is sorted?

Yes - by using table.update_row() or table.update_where(), the indexes
are automatically updated. That's the whole point of RML - it
uses/updates indexes implicitly. If you decide to index a column, and
no code needs to change.

> 2) When you say :
>
> select(m_locations,
> m_locations.x>=loc.x-m_cutoff &&
> m_locations.x<=loc.x+m_cutoff &&
> m_locations.y>=loc.y-m_cutoff &&
> m_locations.y<=loc.y+m_cutoff &&
> m_locations.z>=loc.z-m_cutoff &&
> m_locations.z<=loc.z+m_cutoff)
>
> Do you use any kind of special index for this?

Yes - the TMP query analyser sees that m_locations.z is an indexed
column, and uses that index. If on the other hand x or y were indexed,
that index would be used in preference to z.

If I had implemented user-defined queries, I could write

        select(m_locations,
                     m_locations.x>=loc.x-m_cutoff &&
                     m_locations.x<=loc.x+m_cutoff &&
                     m_locations.y>=loc.y-m_cutoff &&
                     m_locations.y<=loc.y+m_cutoff &&
                     m_locations.z>=loc.z-m_cutoff &&
                     m_locations.z<=loc.z+m_cutoff &&
                query(test_distance(loc)) )

where test_distance would be some user-defined functor to check the
distance more precisely. Even in this scenario, the TMP query analyser
would use m_locations.z as an index. It's not magic, it just uses the
first index is sees.

> In any case this seems to prove that RML does a good job when
> searching on multiple indexes -- you do quite a bit of that.
> Also, inserts/updates are probably very efficient.

Insertion carries the same overheads as Boost.MultiIndex. Update is
possibly more efficient, since nodes can be relocated in a single index
- without needing to destroy/create a new indexed item. This is one of
the inefficiencies of std::set or std::map - if you change the key you
need to allocate a new node.

> But, as I said before, the overall efficiency of this example
> is based on the fact that this is a very smart, custom,
> hand-coded solution (see how you maintain the counter). No
> library would be able to beat its performance using a
> higher-level abstraction (If RTL had a count() operator in
> addition to generic groupby(), we could be much smarter in
> maintaining it during updates).

The RML solution is actually slower because of the incremental updates
to the "neighbours" column. Had they been calculated in one go, then
the query would I suspect run faster.

I would really like to be able to write

  for_each(
    limit(
      sort( (descending(col<2, 0>()), col<0, Location::name>() ),
        group_by( col<0, Location::id_column>(),
         select(
          (locations, locations),
          col<1, Location::x_column>() <= col<0, Location::x_column>() +
m_cutoff &&
          col<1, Location::x_column>() >= col<0, Location::x_column>() -
m_cutoff &&
          col<1, Location::y_column>() <= col<0, Location::y_column>() +
m_cutoff &&
          col<1, Location::y_column>() >= col<0, Location::y_column>() -
m_cutoff &&
          col<1, Location::z_column>() <= col<0, Location::z_column>() +
m_cutoff &&
          col<1, Location::z_column>() >= col<0, Location::z_column>() -
m_cutoff &&
          query(test_distance())
         ),
        ),
       ), 50),
    display_result()
  );

In order to achieve this, I would need to implement sorting using
multiple columns, group_by(), and query(). And this query would I
suspect run as fast as the original. (Though it would not support
dynamic update).

So the fact that I needed to split my query up is nothing to do with the
fundamental design of RML. It was because some of the functions were
not implemented. I will work towards getting these extra functions
implemented, but I have a long todo-list and it won't happen
immediately.

> But again, I don't believe this all proves RML is faster than
> RTL, since the coding approaches were so different. I don't
> even know if it's possible to compare the libraries at this
> point, since they have their strong points in different
> areas. RTL's strong point in the ability to build and run
> SQL-like views. This I tried to show in my example. RML
> seems to be very efficient at maintaining and using multiple
> indexes, but, despite its more SQL-like interface, can
> mimique only very simple SQL queries.

You're only saying that because I haven't got around to implementing
group_by and multi-column sorts. These are less commonly used features
that were not a priority to implement. But they would not be difficult.
When RML can implement the entire query in one statement, we can re-run
the tests if you'd like.

Indexed views can give performance improvements in real SQL databases -
but they can also reduce performance. I believe the same phenomenon
might be occurring in your in-memory relational models.

I believe that you would gain a huge performance increase by using
Boost.MultiIndex as your underlying data structure (as opposed to
std::vector), and indexing the columns you require. That is I believe
the fundamental architectural difference between RML and RTL. By
storing the columns pre-indexed, you don't have any extra overhead.

Regards, Calum


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk