Adam Wulkiewicz wrote:
Hi,

Jpb Via Geometry wrote:
Hi Guys,

I have an rtree where each item in the tree has a group ( Colors perhaps:
red, green, blue, orange...)
I want to do a nearest neighbor search where the search returns the nearest
k-neighbors with at most
one object from each group. 

For example, if we have 

Object A: Red,    distance 1
Object B: Red,    distance 2
Object C: Blue,   distance 3
Object D: Green, distance 4

and I do a nearest neighbor at most one from each group search with k = 2
then
I get object A and C. B is skipped since A is closer of the same group.

It looks like I could get this done by creating a specialization of
distance_query_result class and have the distance_query_result::store method
to check the groups/distances of already stored items and act accordingly.
Then create a specialization of distance query and maybe query_dispatch to
get distance_query to use the new distance_query_result. 

Is there a better way to go about this?

No, there is no "official" (i.e. exposed in the interface) way of doing it besides calling query N times for each group/category (using nearest() && satisfies()) and after that merging the results. So your approach is more optimal than that however it relies on the internals of the R-tree so have in mind that they may be changed some time in the future.

I'd suggest keeping N stacks (neighbors) one for each group and track them separately.
You also have to somehow pass the number of groups into the distance_query_result.
Furthermore you have to implement distance_query_result::greatest_comparable_distance() and return the greatest distance from all groups but only if all heaps contain at least one element each.

But I'm wondering, are the elements from all groups distributed more or less in uniform fashion? I'm asking because otherwise the query may have to traverse big chunks of the tree before finding at least one element from each group. Consider the situation where there is a clear separation between groups and groups are spatially far from each other. In this case my guess is that it'd be better to have separate R-tree for each group, perform knn query on all of them and merge the results.

An alternative to the solution above (for groups spatially far from each other) would be to have one R-tree but to store in the nodes the information about the groups of objects sotred in underlying nodes and custom query visitor taking into account currently stored neaighbors and groups covered by analysed nodes. So the idea is that the pruning of nodes would be done both using the distance to the furthest neighbor already found as well as groups of neighbors. This would not require to merge the result at the end since the query woudl handle everything. But it would require even more changes to the R-tree including introducing new nodes (containing info about underlying groups), insertion visitor (gathering the info about the groups from children and passing it to the newly created node) and packing algorithm (if needed, the same as insertion).


I just realized that you want to get at most one element from each group, not at least one element. So in this case you don't have to store N stacks but only maybe some flags indicating elements of which groups you already have.

However the lack of uniformity of distribution of various groups may still affect the performance because effectively you want to have one closest element from each group and among these pick k nearest ones. So here again if the groups were far from each other you'd probably do better if you had N R-trees one for each group and after N queries sort the elements and pick k of them.

Keep in mind that all of this is only what my intuition tells me. If you want to be sure you have to check it in practice.

Regards,
Adam