Jackrabbit searches across joined nodes

Question

I have tagged objects in a Jackrabbit repository (actually Adobe/Day CQ's CRX, but I think this is the Jackrabbit code):

asset: tags = A, B
- child asset data 1: tags = A, C, E
- child asset data 2: tags = D, E

I want to query against the union of the parent asset's set of tags and one child, i.e. "B C" would match the asset because we have those in the parent and in child 1, but "C D" would not match because there's no combination of parent and one child that matches that because C and D are split across separate child data nodes.

Is there a way to do this in Jackrabbit? We can write an XPath query

\\element(*, dam:Asset)[(@tags = 'C' or *\@tags='C')
                        and (@tags = 'D' or *\@tags='D')]

but that won't work because XPath doesn't seem to guarantee that the * joined child assets are the same, i.e. this means "any child has C/D" and so will match my asset because 1+ children have a C and 1+ children have a D. Instead I could use JCR-SQL2

SELECT * FROM dam:Asset as asset
  LEFT OUTER JOIN nt:unstructured as child ON ISCHILDNODE(child,asset)
  WHERE (asset.tags = 'C' or child.tags = 'C')
    AND (asset.tags = 'D' or child.tags = 'D')

but there's no SELECT DISTINCT in JCR-SQL2: if instead I search for "B E" I will get this asset returned twice because this matches both asset+child1 and asset+child2.

I could postprocess either query result in Java, i.e. filter out false-positive matches for the first case or filter out duplicate results for the second case, but I'm nervous how this would affect paging performance: I'd need to scan more nodes than necessary to weed out bad nodes, and I'd need to scan the lot to compute the correct result size for paging. This ought to be cheaper for the second SQL2 case because if my search is ordered I can spot duplicates based on the node path alone and all duplicates will be consecutive, so I can find a given page's worth of data with cheap scanning only hopefully without reading the whole node for each result, but I don't know the cost of scanning all results for the paging count too even for the simple path-only case.

Yet another option we've considered is denormalising the tags into a single node. In this case, to keep the search accurate, that would have to mean creating a new combined_tags attribute in each child node and perform all searches against the set of child nodes only. However this still suffers from the distinct problem should we match two child nodes beneath the same asset.

Thanks for any suggestions. This is a large instance already and will need to scale further. I've seen other questions which say ModeShape is a JCR implementation that does have SELECT DISTINCT but I think switching to ModeShape just for that would have to be the last resort, if indeed it's possible to host CQ on ModeShape.

One idea we've come up with now is to compute each union of the asset tags and child tags and combine the tags into a single string then write each value as a multivalued property of the asset, i.e. asset + child1 = "A B C E" and asset + child2 = "A B D E", so we get

asset: tags = A, B; tagUnions = "A B C E", "A B D E"

As long as we define a fixed order for combining tags into a string (e.g. alphabetical) we can the search for any combination using tagUnions LIKE '%B%C%' (except I'd use proper delimiters between tags in the real case). Whilst this will work as far as we can see I don't really like it: there's potentially large numbers of tags per asset+child, all with longer names than single letters meaning we'll end up with long strings executing LIKE queries on all of them which likely can't be indexed efficiently.

Another take on this is to make a bitmask: define A=1, B=2 etc. and so store a multivalued integer array here then carry out a bitwise comparison. However that's likely limited to 64 different tags and since we have 1,000+ I don't think we can do this - even if JCR supports bitwise operations, which I'd expect it won't.

So I'm still on the lookout for a clean database-like solution for this. You've missed the bounty I put up but there's still ticks, votes and gratitude for any help.

MrGomez MrGomez · Accepted Answer · 2012-03-28T21:29:46

From the Apache Jackrabbit mailing list:

Yes, unfortunately union queries are not supported. Any work on that area would be much appreciated.

Meanwhile the best workaround is probably to do two separate queries and to explicitly perform the union in the application code by combining the two result sets.

So, that's out as an option. Looking at the SQL you've provided:

but there's no SELECT DISTINCT in JCR-SQL2: if instead I search for "B E" I will get this asset returned twice because this matches both asset+child1 and asset+child2.

I looked at the possible solutions supported by Jackrabbit and came up empty handed. However, I agree with the solution presented here:

What I did is to do a simple SELECT with appropriated ORDER BYs... then each time I used a row, I veried that it isn't the same as the previous :-)

(Sics preserved.)

While the ORDER BY is potentially dubious unless you require database-backed sorting, is there anything preventing you from constructing a hashset in your controller to limit your results to only unique values using the JCR API?

Jackrabbit searches across joined nodes

1 Answers