Are Mondrian / OLAP the wrong tool for joining large dimensions/sets?

Question

Summary: Most of the examples I've seen of MDX joins have involved joining relatively small sets, say with tens or hundreds of items each. But I find myself also wanting to try joining (in particular "non-empty joining") sets that have thousands or tens of thousands of items each, and it's not working well so far. I'm wondering if this could be made to work, or if I perhaps need to consider using something other than Mondrian/OLAP.

To be concrete, I have a cube that records interactions between Firms (n=7000) and Clients (n=27000). Currently both Firm and Client are completely flat hierarchies; there's the All level and the individual-company level, with no other levels in between. There is a central fact table, and separate dimension tables for Firms and for Clients.

My users at least appear to want to get summary reports along these lines, aggregating all the non-empty interactions between Firms and Clients:

select
  [Measures].[Amount] on columns,
  NonEmptyCrossJoin([Firm].Children,
                      [Client].Children) on rows
from MyCube

But this query and variations on it don't work in my test Mondrian setup. Either I get an OutOfMemoryException (on a 2GB Java heap), or Java seems to spend impossibly long time in mondrian.rolap.RolapResult$AxisMember.mergeTuple(TupleCursor). (I can provide a more complete stack trace if it would help.) By "impossibly long" I mean Java will stay slaving away at the query for hours and hours before I give up.

I initially expected the above query to perform ok, because conceptually it could be done somewhat efficiently by just doing a SQL query along these lines:

select Firm, Client, Sum(Amount) as n
from fact, firm, client
where fact.firmid = firm.firmid and fact.clientid = client.clientid
group by Firm, Client

(In fact, if I execute something like this directly in MySql it doesn't take more than 15sec to execute.)

But from the debug logs Mondrian doesn't seem to attempt this optimization. Instead it appears to be doing the join internally, and in a way that ends up being particularly slow. I've set mondrian.native.crossjoin.enable=true in my mondrian.properties, but this doesn't seem like one of the join types that Mondrian is able to "make native". (If I turn on mondrian.native.unsupported.alert=ERROR then I get the corresponding exception.)

I'm left wondering whether I need to prevent my users from attempting joins on such large dimensions/sets, or whether Mondrian is maybe not the tool I'm looking for here. But maybe I'm just doing something wrong.

Chris Chris · Accepted Answer · 2011-11-30T03:25:22

To follow up, I tried setting up an analogous cube in Sql Server Analysis Services (Sql Server 2008), and it seems that icCube has a point about different OLAP tools performing differently:

Even before I learned much about SSAS best practices, performance on this type of MDX was much improved. A query along these lines

select
  [Measures].[Amount] on columns,
  NON EMPTY
  crossjoin([Firms].[Firm Name].Children,
            [Clients].[Client Name].Children)
  on rows
from MyCube

went from being non-viable with Mondrian to taking around ten seconds under Sql Server. Conceivably this has to do with MS' Business Intelligence Development Studio guiding me into creating a MOLAP cube by default, or perhaps SSAS has a smarter query planner.

In any case, perhaps this is fast enough for me. If not, I'm not yet sure how much more optimized SSAS can get in this case. (One disappointing thing is that, even when I re-run the query a second time, it still takes about 10 sec; I was hoping caching might have a more dramatic effect.)

Tangentially, you may notice in the just-quoted MDX I've replaced my original NonEmptyCrossJoin with a normal crossjoin combined with NON EMPTY. This is because, at least in the Sql Server world, NonEmptyCrossJoin is apparently regarded as a deprecated bad practice. (This is noted in Microsoft's MDX Language Reference. Mosha, one of the former SSAS devs, describes the situation in an article called MDX: NonEmpty, Exists and evil NonEmptyCrossJoin. Short version is that NonEmptyCrossJoin has confusing semantics and limited application, and since Sql Server 2005 or so, the query optimizer has been smart enough to make your query fast without NonEmptyCrossJoin.) So I've substituted a more modern approved equivalent in the above MDX. (It does still work with NonEmptyCrossJoin as well, though NonEmptyCrossJoin does not speed things up at all.)

Are Mondrian / OLAP the wrong tool for joining large dimensions/sets?

4 Answers