|
Boost : |
From: Arkadiy Vertleyb (vertleyb_at_[hidden])
Date: 2005-10-09 08:48:47
"Calum Grant" <calum_at_[hidden]> wrote
> In SQL/RML you can index any arbitrarily complicated expression.
>
> If you need to index something complex, you create a column for it. For
> example if you had a table of people, with (name, height, weight). If
> you needed to work with bmi = weight/(height^2), you create another
> column for it.
>
> people.insert(Person(name, height, weight,
> weight/(height*height));
>
> The RTL approach would not need to add the fourth column, and that is an
> advantage that RTL has over RML.
>
> Similarly for self-join. If you wanted to store an all-pairs distance
> between two cities, you create a table (city1, city2, distance).
Right, so that your user has to maintain this "index" by hand. And if you
have many such queries, your user has to manually maintain all these
"indexes". May become a major inconvenience. RTL can do better than that.
> But in practice, in SQL I have never seen an index on anything other
> than a column. Occasionally I have seen a multi-column index. So this
> is why supporting computed indexes has not been a priority for RML.
The priority of RTL has been to provide maximum flexibility for _any_ kind
of query, no matter how complex. With simple queries it's hard to explain
why a relational engine should be used rather than, say, Boost.MultiIndex.
> > 2) RTL can _incrementally_ update any such index when the underlying
> > table has been changed.
>
> Same in RML. You add data, indexes change automatically. You change a
> field in a row, indexes update automatically. Probably more
> efficiently.
Yes, but RML indexes tables, and this is the same as Boost.MultiIndex does.
RTL allows to indexes _any_ expression, which results in completely new
capabilities. I don't consider this, BTW, a weekness of RML -- I consider
this a major strength of RTL. I don't know of any other library or RDBMS
that provides the same.
> > Did I use the most efficient expression for this particular task? Of
> > course not. I used cross-product where I would use a range (or merge)
>
> > join otherwise. I did it _intentionally_, because I wanted to provide
>
> > an analogy of a complicated condition, such condition for which
> > optimization is simply not possible.
> >
> > I am pretty sure your library figured out how to do this query
> > efficiently, without considering every possible combination (it didn't
>
> > consider 10^12 combinations, did it?). However, what it would do if
> > the condition was more complicated? Like calculating the distance,
> > matching people by certain characteristics, etc.? RTL can build an
> > index on even slow expression, after which subsequent queries become a
>
> > few orders of magnitude faster. And the indexes are not broken when
> > the tables are changed.
>
> That is a nice idea. One could generalize - have a generic observer on
> any container. It could turn your entire application data-driven.
> Nice. I have seen observer-driven architectures work extremely well in
> practice.
We don't use observer -- we think it's too expensive. We utilize our
transaction mechanism instead. We store all the modifications inside the
transaction object, and, when transaction is committed, re-calculate every
index involved.
> > Having said all this, RTL doesn't enforce this kind of usage. If the
> > query needs to be executed only once, or if one doesn't want to consum
>
> > memory for indexes, any appropriate relational expression can be
> > built, with any number of indexes in different nodes, to achieve
> > needed tradeof between speed, initial response, memory consumption,
> > etc. My firm belief is that this kind of flexibility is simply
> > impossible with SQL-like approach.
>
> I wouldn't knock the SQL approach - it is an extremely effective
> paradigm for data query and represention. There are some things that
> would not be possible in SQL, but one can always work around them. For
> example maintaining an all-pairs distance between two cities. One can
> simply declare a table (city1, city2, distance), and index on the
> columns you wanted. Whereas in RTL this data would be stored implicitly
> in an index, RML would require an explicit table.
IMO, SQL is so successfull because of the client-server architecture. Any
scripting language would do -- SQL just happend to win the market. AFAIK,
the database comunity is not really considering SQL the way to proceed.
> The reason RML uses the SQL paradigm is because I know that it works.
>
> The RTL view approach would make this much easier to maintain. The
> question is, what are the overheads of this convenience?
>
> > > I don't think self-joins come up that often in real life however.
> >
> > I think they are. I just gave examples where they could be (and are)
> > used. This, however, is not about a self-join. This is about any query
>
> > which is slightly more complicated than just a simple index-based
> > selection.
>
> Self-joins don't come up in SQL.
Why?
SELECT e.name as employee, m.name AS manager
FROM employees e, employees m
WHERE e.manager = m.code
(if I remember the syntax correctly)
> The assumption you make is that the query analyser is some kind of black
> box that will fail in mysterious ways. Actually the rules for query
> optimization are very well spelt out in the tutorial. It is precisely
> because the rules for indexing are so straightforward that there will be
> no unexpected performance penalties in RML.
>
> It will pick indexes in any complex multi-table conjunction for any
> comparators.
>
> > Once again, I don't believe this is a right time for any performance
> > contests. They will not prove anything. I think RML will be faster
> > in some cases, and RTL -- in the others. And a couple of days spent on
>
> > optimization can dramatically change the picture.
>
> I think you should work on that, and then we can see what's what. The
> only reason I make a fuss is because you say your approach is more
> efficient.
Well, as follows from my other post, I am not so sure RTL is less efficient
than RML, even right now ;-)
> > What I would like to be discussing instead is the _design principles_
> > behind the libraries.
>
> Okay, that is a good way forward.
>
> RTL has (from what I understand) an observer-driven architecture.
> Correct me if I am wrong.
No, as I already mentioned, we don't use observers -- we think it's too
expensive. We let the user specify explicitly what needs to be maintained.
Otherwise, we could end up maintaining what's not required.
> This means that you don't index the
> underlying data, you index views of the data.
Correct. Data has one index, maintained by the underlying implementation,
currently sorted std::vector or Boost.MultiIndex (we use Boost.MultiIndex
not because of its multi-index capabilities, but rather because it allows
range queries on partial predicate, which std::set doesn't. All the other
indexes are applied and maintained externally. They can be built on top of
any relations, including tables.
The reason that such indexing is possible is that any relation in RTL allows
for range queries (equal_range, lower_bound, upper_bound), that are
expressed through the range queries of this relation's parameters. Also RTL
garantees uniqueness of the key for any relation. This allows for indexing,
where index stores such keys.
> RML on the other hand is a multi-index architecture. The query analyser
> is basically a way of avoiding writing complex index probes, iteration
> and filters. From an efficiency point of view, it is exactly the same
> as writing out C++ to manipulate a multi-index.
>
> So, RML has the potential to be more efficient because multi-indexes are
> more efficient than multiple single-indexes.
Maybe on simple queries, but I am still not sure. OTOH, RTL has flexibility
which allows to build complex views, maintain them automatically, and
execute them with maximum efficiency.
Regards,
Arkadiy
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk