I have tagged objects in a Jackrabbit repository (actually Adobe/Day CQ's CRX, but I think this is the Jackrabbit code):
- asset: tags = A, B
- child asset data 1: tags = A, C, E
- child asset data 2: tags = D, E
I want to query against the union of the parent asset's set of tags and one child, i.e. "B C" would match the asset because we have those in the parent and in child 1, but "C D" would not match because there's no combination of parent and one child that matches that because C and D are split across separate child data nodes.
Is there a way to do this in Jackrabbit? We can write an XPath query
\\element(*, dam:Asset)[(@tags = 'C' or *\@tags='C')
and (@tags = 'D' or *\@tags='D')]
but that won't work because XPath doesn't seem to guarantee that the *
joined child assets are the same, i.e. this means "any child has C/D" and so will match my asset because 1+ children have a C and 1+ children have a D. Instead I could use JCR-SQL2
SELECT * FROM dam:Asset as asset
LEFT OUTER JOIN nt:unstructured as child ON ISCHILDNODE(child,asset)
WHERE (asset.tags = 'C' or child.tags = 'C')
AND (asset.tags = 'D' or child.tags = 'D')
but there's no SELECT DISTINCT
in JCR-SQL2: if instead I search for "B E" I will get this asset returned twice because this matches both asset+child1 and asset+child2.
I could postprocess either query result in Java, i.e. filter out false-positive matches for the first case or filter out duplicate results for the second case, but I'm nervous how this would affect paging performance: I'd need to scan more nodes than necessary to weed out bad nodes, and I'd need to scan the lot to compute the correct result size for paging. This ought to be cheaper for the second SQL2 case because if my search is ordered I can spot duplicates based on the node path alone and all duplicates will be consecutive, so I can find a given page's worth of data with cheap scanning only hopefully without reading the whole node for each result, but I don't know the cost of scanning all results for the paging count too even for the simple path-only case.
Yet another option we've considered is denormalising the tags into a single node. In this case, to keep the search accurate, that would have to mean creating a new combined_tags attribute in each child node and perform all searches against the set of child nodes only. However this still suffers from the distinct problem should we match two child nodes beneath the same asset.
Thanks for any suggestions. This is a large instance already and will need to scale further. I've seen other questions which say ModeShape is a JCR implementation that does have SELECT DISTINCT
but I think switching to ModeShape just for that would have to be the last resort, if indeed it's possible to host CQ on ModeShape.
One idea we've come up with now is to compute each union of the asset tags and child tags and combine the tags into a single string then write each value as a multivalued property of the asset, i.e. asset + child1 = "A B C E" and asset + child2 = "A B D E", so we get
- asset: tags = A, B; tagUnions = "A B C E", "A B D E"
As long as we define a fixed order for combining tags into a string (e.g. alphabetical) we can the search for any combination using tagUnions LIKE '%B%C%'
(except I'd use proper delimiters between tags in the real case). Whilst this will work as far as we can see I don't really like it: there's potentially large numbers of tags per asset+child, all with longer names than single letters meaning we'll end up with long strings executing LIKE
queries on all of them which likely can't be indexed efficiently.
Another take on this is to make a bitmask: define A=1, B=2 etc. and so store a multivalued integer array here then carry out a bitwise comparison. However that's likely limited to 64 different tags and since we have 1,000+ I don't think we can do this - even if JCR supports bitwise operations, which I'd expect it won't.
So I'm still on the lookout for a clean database-like solution for this. You've missed the bounty I put up but there's still ticks, votes and gratitude for any help.