Boost logo

Boost :

From: Calum Grant (calum_at_[hidden])
Date: 2005-10-09 05:52:48


> > Here is the equivalent in RML. Forget 3000 rows, what about adding
> > 1000000 rows, then adding another 1000000 to it? What time
> does the
> > equivalent RTL take?
>
> Calum, I think you are missing my point. The above example was
> provided to illustrate my two statements:
>
> 1) RTL can index any (arbitrary complicated) expression, not just
> table;

In SQL/RML you can index any arbitrarily complicated expression.

If you need to index something complex, you create a column for it. For
example if you had a table of people, with (name, height, weight). If
you needed to work with bmi = weight/(height^2), you create another
column for it.

        people.insert(Person(name, height, weight,
weight/(height*height));

The RTL approach would not need to add the fourth column, and that is an
advantage that RTL has over RML.

Similarly for self-join. If you wanted to store an all-pairs distance
between two cities, you create a table (city1, city2, distance).

But in practice, in SQL I have never seen an index on anything other
than a column. Occasionally I have seen a multi-column index. So this
is why supporting computed indexes has not been a priority for RML.

> 2) RTL can _incrementally_ update any such index when the underlying
> table has been changed.

Same in RML. You add data, indexes change automatically. You change a
field in a row, indexes update automatically. Probably more
efficiently.

> Did I use the most efficient expression for this particular task? Of
> course not. I used cross-product where I would use a range (or merge)

> join otherwise. I did it _intentionally_, because I wanted to provide

> an analogy of a complicated condition, such condition for which
> optimization is simply not possible.
>
> I am pretty sure your library figured out how to do this query
> efficiently, without considering every possible combination (it didn't

> consider 10^12 combinations, did it?). However, what it would do if
> the condition was more complicated? Like calculating the distance,
> matching people by certain characteristics, etc.? RTL can build an
> index on even slow expression, after which subsequent queries become a

> few orders of magnitude faster. And the indexes are not broken when
> the tables are changed.

That is a nice idea. One could generalize - have a generic observer on
any container. It could turn your entire application data-driven.
Nice. I have seen observer-driven architectures work extremely well in
practice.

> Having said all this, RTL doesn't enforce this kind of usage. If the
> query needs to be executed only once, or if one doesn't want to consum

> memory for indexes, any appropriate relational expression can be
> built, with any number of indexes in different nodes, to achieve
> needed tradeof between speed, initial response, memory consumption,
> etc. My firm belief is that this kind of flexibility is simply
> impossible with SQL-like approach.

I wouldn't knock the SQL approach - it is an extremely effective
paradigm for data query and represention. There are some things that
would not be possible in SQL, but one can always work around them. For
example maintaining an all-pairs distance between two cities. One can
simply declare a table (city1, city2, distance), and index on the
columns you wanted. Whereas in RTL this data would be stored implicitly
in an index, RML would require an explicit table.

The reason RML uses the SQL paradigm is because I know that it works.

The RTL view approach would make this much easier to maintain. The
question is, what are the overheads of this convenience?

> > I don't think self-joins come up that often in real life however.
>
> I think they are. I just gave examples where they could be (and are)
> used. This, however, is not about a self-join. This is about any query

> which is slightly more complicated than just a simple index-based
> selection.

Self-joins don't come up in SQL.

The assumption you make is that the query analyser is some kind of black
box that will fail in mysterious ways. Actually the rules for query
optimization are very well spelt out in the tutorial. It is precisely
because the rules for indexing are so straightforward that there will be
no unexpected performance penalties in RML.

It will pick indexes in any complex multi-table conjunction for any
comparators.

> Once again, I don't believe this is a right time for any performance
> contests. They will not prove anything. I think RML will be faster
> in some cases, and RTL -- in the others. And a couple of days spent on

> optimization can dramatically change the picture.

I think you should work on that, and then we can see what's what. The
only reason I make a fuss is because you say your approach is more
efficient.

> What I would like to be discussing instead is the _design principles_
> behind the libraries.

Okay, that is a good way forward.

RTL has (from what I understand) an observer-driven architecture.
Correct me if I am wrong. This means that you don't index the
underlying data, you index views of the data.

RML on the other hand is a multi-index architecture. The query analyser
is basically a way of avoiding writing complex index probes, iteration
and filters. From an efficiency point of view, it is exactly the same
as writing out C++ to manipulate a multi-index.

So, RML has the potential to be more efficient because multi-indexes are
more efficient than multiple single-indexes.

Best Regards, Calum


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk